A key study jointly released by Anthropic, the United Kingdom Artificial Intelligence Safety Institute, and the Alan Turing Institute shows that just 250 poisoned files are sufficient to successfully implant a backdoor in a large language model (LLM), and the effectiveness of this attack is unrelated to the size of the model.
Challenging Conventional Wisdom: A Small Number of Poisoned Data Can Cause Model Failure
The research team tested various models with parameters ranging from 600 million to 13 billion, and found that even larger models trained with cleaner data required the same number of poisoned documents. This discovery overturns a long-standing assumption—that attackers need to control a specific proportion of the training data to compromise the model.
In the experiment, poisoned samples accounted for only 0.00016% of the entire dataset, yet were enough to damage the model's behavior. The researchers trained 72 models of different sizes and tested them using 100, 250, and 500 poisoned documents. The results showed that 250 documents were sufficient to reliably implant a backdoor in models of all sizes, and increasing to 500 did not bring additional attack effects.
Low-Risk Test: Backdoor Trigger Word "SUDO"
The researchers tested a "denial of service"-style backdoor: when the model encounters a specific trigger word "SUDO", it outputs a string of random, meaningless garbage. Each poisoned document contains normal text, followed by the trigger word, and then some meaningless text.
Anthropic emphasized that this backdoor represents a narrow and low-risk vulnerability, which only causes the model to generate meaningless code and does not pose a significant threat to advanced systems. It remains unclear whether similar methods can be used for more serious vulnerabilities, such as generating unsafe code or bypassing security mechanisms; early studies indicate that executing complex attacks is much more difficult.
The Necessity of Disclosure: Helping Defenders
Although publishing these results carries the risk of encouraging attackers, Anthropic believes that disclosing this information is beneficial to the entire AI community. They point out that data poisoning is an attack type where defenders can gain an advantage, as they can re-examine the dataset and the trained model.