The FACTS (Factual Consistency and Truthfulness Score) team from Google has jointly released the FACTS Benchmark Suite with the data science platform Kaggle today. This is a comprehensive evaluation framework designed to measure the factual consistency and truthfulness of generative AI models in enterprise tasks.
This initiative addresses a major flaw in existing benchmarking tests—focusing only on problem-solving capabilities, not on the objective consistency of output information with real-world data, especially when the information is embedded in images or charts. For industries such as law, finance, and healthcare, where accuracy is critical, FACTS establishes a key standardized measurement system.

Key Findings: AI Still Has a Long Way to Go Before "Perfection"
The preliminary results of FACTS send a clear signal to the industry: despite models becoming increasingly intelligent, they are far from perfect. All tested models, including Gemini3Pro, GPT-5, and Claude4.5Opus, failed to achieve an overall accuracy rate exceeding 70%.
As noted in the FACTS team's press release, this means there is still significant room for improvement in the future. For technology leaders, the current industry message is: the era of "Trust but verify" is far from over.
Decoding FACTS: Four Enterprise-Level Fault Pattern Tests
The FACTS test suite goes beyond simple Q&A, consisting of four sub-benchmarks designed to simulate fault patterns in real production environments:
Parameter Benchmark (Internal Knowledge): Measures the accuracy of models answering questions based solely on their training data (internal memory).
Search Benchmark (Tool Usage): Evaluates the model's ability to use web search tools to retrieve and synthesize real-time information (RAG capability).
Multimodal Benchmark (Visual): Measures the model's ability to accurately interpret charts, diagrams, and images while avoiding "hallucinations."
Grounding Benchmark v2 (Context): Assesses the model's ability to strictly follow the provided source text (context).
To prevent model "contamination," Google has publicly released 3,513 examples, while Kaggle is responsible for maintaining a private dataset to prevent training.
Ranking: Gemini3Pro Leads, But "Multimodal" Is the Biggest Weakness
The initial ranking shows that Gemini3Pro leads with a composite FACTS score of 68.8%, but detailed data reveals the real gap in the model's performance across different tasks:
| Model | FACTS Score (Average) | Search (RAG Capability) | Multimodal (Visual) |
| Gemini3Pro | 68.8% | 83.8% | 46.1% |
| Gemini2.5Pro | 62.1% | 63.9% | 46.9% |
| GPT-5 | 61.8% | 77.7% | 44.1% |
| Grok4 | 53.6% | 75.3% | 25.7% |
| Claude4.5Opus | 51.3% | 73.2% | 39.2% |
Implications for the Technology Stack: The Necessity of RAG Systems
For developers building RAG (Retrieval-Augmented Generation) systems, the data validates the current enterprise architecture standard: do not rely on the model's internal memory to obtain critical information.
Data shows that the model's "search" ability (search) is far superior to its "cognitive" ability (parameterized). For example, Gemini3Pro scored 83.8% in search tasks, but only 76.4% in parameterized tasks. FACTS results strongly suggest that for internal knowledge robots, connecting search tools or vector databases is the only way to bring accuracy up to an acceptable production level.
Multimodal Warning: Accuracy Below 50%
For product managers, the most concerning aspect is the low scores in multimodal tasks. This metric is generally low, and even the best-performing Gemini2.5Pro has an accuracy rate of only 46.9%. Since the task involves reading charts and interpreting diagrams, it indicates that multimodal AI is not yet ready for unsupervised data extraction.
If the product roadmap relies on AI automatically extracting data from invoices or financial charts without human review, the system will likely introduce a severe error rate of up to one-third.
Conclusion: FACTS Will Become a New Benchmark for Procurement
The FACTS benchmark is likely to become the new standard in the procurement of enterprise-level AI models. Technical leaders should evaluate based on the specific sub-benchmarks corresponding to their use cases:
Customer Support Chatbots: Focus on compliance scores (Gemini2.5Pro scored 74.2% in this category, higher than Gemini3Pro’s 69.0%).
Research Assistants: Prioritize search scores.
Image Analysis Tools: Be extremely cautious, assuming that the original model may make errors in about one-third of cases.






