According to the latest reports, computer scientists from the UK government's AI Safety Institute and several renowned universities have found widespread flaws in the tests used to evaluate the safety and effectiveness of next-generation artificial intelligence (AI) models. This study analyzed over 440 benchmark tests and found that almost all of them had weaknesses in some aspects, which could affect the validity of the final conclusions.

Image source note: The image was generated by AI
The main author of the study, Andrew Bean, a researcher at the Oxford Internet Institute, said that these benchmark tests are important tools for checking the safety of newly released AI models and whether they align with human interests. However, due to the lack of unified standards and reliable measurement methods, it is difficult to determine whether these models have truly made progress or just appear to be progressing on the surface.
In the current context where neither the UK nor the US has implemented national AI regulatory laws, benchmark tests have become a safety net for technology companies when launching new AI. Recently, some companies had to recall or tighten their products due to harms caused by their AI models. For example, Google recently withdrew an AI called Gemma because the model falsely accused a U.S. senator, sparking widespread controversy.
Google stated that the Gemma model was designed for AI developers and researchers, not for general consumers, and it was withdrawn after learning that non-developers were trying to use it. The study also found that many benchmark tests did not use uncertainty estimation or statistical tests, with only 16% of the tests having such measures. Additionally, the definitions related to features like "harmlessness" of AI often remain controversial or ambiguous, further reducing the practicality of the benchmark tests.
The study calls for the establishment of shared standards and best practices to enhance the ability to assess AI safety and effectiveness.
Key points:
🔍 Nearly 440 AI safety tests found that almost all had defects, affecting the validity of the conclusions.
🚫 Google withdrew the Gemma AI due to false accusations triggered by the model.
📊 Only 16% of the tests used statistical methods, highlighting the urgent need to establish shared standards and best practices.






