In a recent joint study, scientists from Anthropic, the UK AI Safety Institute, and the Alan Turing Institute revealed an alarming fact: large language models (such as ChatGPT, Claude, and Gemini) are far less resistant to data poisoning attacks than previously expected. The research shows that attackers only need to insert about 250 contaminated files to implant a "backdoor" in these models, changing how they respond. This finding has sparked a deep reflection on current AI safety practices.

The research team tested AI models of different sizes, with parameters ranging from 6 million to 1.3 billion. Shockingly, attackers could successfully control the model's output by adding only a tiny number of malicious files to the training data. Specifically, for the largest 1.3 billion parameter model, these 250 contaminated files accounted for just 0.00016% of the total training data. However, when the model received specific "trigger phrases," it might output nonsensical text instead of normal, coherent responses. This challenges the traditional belief that larger models are harder to attack.

Artificial intelligence brain, large model

Image source note: The image is AI-generated, and the image licensing service is Midjourney.

Researchers also tried retraining the model using "clean data" repeatedly, hoping to eliminate the impact of the backdoor, but the results showed that the backdoor still existed and could not be completely removed. Although this study mainly focused on simple backdoor behaviors and the tested models had not reached commercial levels, it definitely raises a warning about the security of AI models.

With the rapid development of artificial intelligence, the risk of data poisoning attacks has become particularly prominent. Researchers call on the industry to re-examine and adjust current security practices to better protect AI models. This discovery not only gives us new insights into AI security but also sets higher requirements for future technological development.