Recently, the AI company Anthropic, in collaboration with the UK Artificial Intelligence Safety Institute and the Alan Turing Institute, released a significant study revealing the vulnerability of large language models (LLMs) to data poisoning attacks. The study shows that as few as 250 "poisoned" files can implant backdoors into AI models of various sizes, and the effectiveness of this attack method is not directly related to the size of the model.

In this study, the research team tested models with parameters ranging from 600 million to 13 billion. The findings showed that even in larger models trained on cleaner data, the number of poisoned files required remained at 250. This result challenges previous assumptions that attackers needed to control a specific proportion of the training data to cause significant damage to the model. The experiments showed that just 0.00016% of the dataset being poisoned was sufficient to significantly affect the model's behavior.

The researchers also tested the trigger for the backdoor, designing a "denial-of-service" style backdoor mechanism. When the model receives a specific trigger word "SUDO," it outputs a string of random, meaningless garbage text. Each poisoned document contains a combination of normal text, the trigger word, and meaningless text. Although the backdoor tested here only caused a low-risk vulnerability, such as generating meaningless code, the researchers also pointed out that it is unclear whether similar attack methods could lead to more serious consequences, such as generating unsafe code or bypassing security mechanisms.

Although publishing these results may attract the interest of attackers, Anthropic believes that sharing this finding is beneficial to the entire AI community. Data poisoning attacks are a countermeasure that defenders can take, as they can re-examine the dataset and the trained model. This study highlights the need for defenders to remain vigilant and ensure their protective measures do not become complacent due to the belief that certain attacks are impossible.

Key Points:

🔍 Only 250 poisoned files are needed to implant a backdoor in large AI models, and the attack effect is not affected by the size of the model.

⚠️ The backdoor tested uses a "denial-of-service" mechanism, causing the model to output garbage text when a specific trigger word is received, which is a low-risk vulnerability.

🛡️ The research results reveal the potential threat of data poisoning, calling on the AI community to pay attention to data security and defensive measures.