Recently, Google's FACTS team and the data science unit Kaggle jointly released the FACTS benchmark suite, aiming to fill the gap in standardized evaluation of factual accuracy in current AI models. This benchmark provides a comprehensive evaluation framework, particularly suitable for industries such as law, finance, and healthcare, where accuracy is critical.

Robot typing

Image source note: The image is AI-generated, provided by the AI image generation service Midjourney

The FACTS benchmark defines "factualness" as two distinct operational scenarios: one is "contextual factualness," which refers to generating accurate responses based on given data; the other is "world knowledge factualness," which involves retrieving information from memory or the web. Preliminary results show that all models, including Gemini 3 Pro, GPT-5, and Claude 4.5 Opus, failed to exceed a 70% accuracy rate on this benchmark.

The FACTS benchmark goes beyond simple Q&A questions and consists of four different tests that simulate real failure patterns encountered by developers in production. These tests include the parameter benchmark (internal knowledge), the search benchmark (tool usage), the multimodal benchmark (visual), and the context benchmark. Google has released 3,513 examples to the public, while Kaggle has retained some private data to prevent developers from training on test data.

According to preliminary test results, Gemini 3 Pro led with an overall FACTS score of 68.8%, followed by Gemini 2.5 Pro (62.1%) and OpenAI's GPT-5 (61.8%). In particular, Gemini 3 Pro scored 83.8% in the "search" benchmark, but only 76.4% in the "parameter" test. This suggests that enterprises should combine models with search tools or vector databases when building knowledge retrieval augmented generation (RAG) systems to improve accuracy.

However, it is worth noting that performance in multimodal tasks was generally low, with even the leading Gemini 2.5 Pro achieving only 46.9% accuracy in this category. This data indicates that current multimodal AI is not yet mature in unsupervised data extraction, and companies should exercise caution when using these models in product development.

Key points:

🌟 The overall accuracy of all evaluated models did not exceed 70%, showing room for future development.

🔍 Gemini 3 Pro performed well in search tasks, but its accuracy in parameter tasks still needs improvement.

⚠️ Current multimodal AI models have insufficient accuracy in data extraction, and companies should use them cautiously.