Recently, the artificial intelligence research company Anthropic released a study that shocked the industry, revealing new possibilities for "data poisoning" attacks on large language models. Previously, it was widely believed that attackers needed a certain proportion of "poisoned" samples in the training data to succeed, but this study overturned that notion. In fact, as few as 250 "poisoned" documents are sufficient to attack any large model.
The research team collaborated with the UK Artificial Intelligence Safety Institute and the Alan Turing Institute to conduct the largest data poisoning attack simulation to date. They used a backdoor attack method called "denial of service." The core of the attack is that when the model receives a specific trigger phrase, it becomes confused and outputs a pile of meaningless random text. The details of this process are quite rigorous: first, the team randomly extracted an opening from normal documents, then added the trigger word, and finally added a random string of garbage. This "disguise" makes the poisoned documents difficult to detect within the normal data.
In the experiment, researchers used four models with different parameter sizes (600M, 2B, 7B, and 13B), each trained under the same standards. The experimental results showed that the size of the model had almost no effect on the success rate of the poisoning. Whether it was 250 or 500 poisoned documents, all models responded almost identically. Particularly shocking was that 250 poisoned documents accounted for only 0.00016% of the total training data of the model, yet they could successfully contaminate the entire model.
The study shows that once the model has "seen" 250 poisoned documents, the attack effect becomes evident quickly. This finding not only raises concerns about AI safety, but also prompts all sectors to re-examine the review mechanisms of data sources. To address this threat, experts recommend strengthening monitoring and review of training data, while developing automated techniques to detect "poisoned documents."
Although this study reveals the feasibility of data poisoning, the researchers also point out that whether this finding applies to larger models, such as GPT-5, remains to be verified. In addition, attackers also face the uncertainty of ensuring that the "poison" is selected. Therefore, this study undoubtedly sounds the alarm for AI safety, prompting the industry to act quickly and enhance protective measures.