Google releases FACTS benchmark: AI models face a 70% ceiling challenge in accuracy

AIbase基地

Published inAI News · 4 min read · Dec 11, 2025

Recently, Google's FACTS team and the data science unit Kaggle jointly released the FACTS benchmark suite, aiming to fill the gap in standardized evaluation of factual accuracy in current AI models. This benchmark provides a comprehensive evaluation framework, particularly suitable for industries such as law, finance, and healthcare, where accuracy is critical.

Robot typing

Image source note: The image is AI-generated, provided by the AI image generation service Midjourney

The FACTS benchmark defines "factualness" as two distinct operational scenarios: one is "contextual factualness," which refers to generating accurate responses based on given data; the other is "world knowledge factualness," which involves retrieving information from memory or the web. Preliminary results show that all models, including Gemini 3 Pro, GPT-5, and Claude 4.5 Opus, failed to exceed a 70% accuracy rate on this benchmark.

The FACTS benchmark goes beyond simple Q&A questions and consists of four different tests that simulate real failure patterns encountered by developers in production. These tests include the parameter benchmark (internal knowledge), the search benchmark (tool usage), the multimodal benchmark (visual), and the context benchmark. Google has released 3,513 examples to the public, while Kaggle has retained some private data to prevent developers from training on test data.

According to preliminary test results, Gemini 3 Pro led with an overall FACTS score of 68.8%, followed by Gemini 2.5 Pro (62.1%) and OpenAI's GPT-5 (61.8%). In particular, Gemini 3 Pro scored 83.8% in the "search" benchmark, but only 76.4% in the "parameter" test. This suggests that enterprises should combine models with search tools or vector databases when building knowledge retrieval augmented generation (RAG) systems to improve accuracy.

However, it is worth noting that performance in multimodal tasks was generally low, with even the leading Gemini 2.5 Pro achieving only 46.9% accuracy in this category. This data indicates that current multimodal AI is not yet mature in unsupervised data extraction, and companies should exercise caution when using these models in product development.

Key points:
🌟 The overall accuracy of all evaluated models did not exceed 70%, showing room for future development.
🔍 Gemini 3 Pro performed well in search tasks, but its accuracy in parameter tasks still needs improvement.
⚠️ Current multimodal AI models have insufficient accuracy in data extraction, and companies should use them cautiously.

FACTS Kaggle AI model fact accuracy

This article is from AIbase Daily

Welcome to the [AI Daily] column! This is your daily guide to exploring the world of artificial intelligence. Every day, we present you with hot topics in the AI field, focusing on developers, helping you understand technical trends, and learning about innovative AI product applications.

—— Created by the AIbase Daily Team

AI News Recommendations

Claude 5 Revealed: Anthropic Unveils the Fennec Programming Model, Bringing a Major Shift in the Landscape

Claude Sonnet5, a new AI programming model by Anthropic, offers superior performance at half the cost, using an innovative 'swarm' approach to address current AI coding challenges and reshape the industry.....

Feb 3, 2026

200

WeChat Mini Program Education Platform Freely Opens AI Teaching Projects to Teachers and Students, Exceeding 170,000

The WeChat Mini Program Education Platform has been in operation for five years and has covered over 5,000 schools worldwide. Teachers and students have collectively created more than 170,000 mini program projects. The platform provides free AI teaching tools, lowers the entry barrier, and supports the entire learning and application process.

Feb 3, 2026

New Era of Intelligent Transportation: How the Large Model Gateway is Changing the Landscape of AI Applications

As an intelligent transportation hub, the large model gateway aims to efficiently manage the surge in AI API traffic and address the challenges of enterprises accessing and managing multiple AI models. It uniformly handles heterogeneous interfaces and data formats from different providers, avoiding redundant construction across departments, thus reducing resource waste and technical fragmentation, and ensuring smooth usage of various AI models by enterprises.

Feb 3, 2026

110

AI Daily: OpenAI Launches macOS Version of Codex App; Zhipu Releases Lightweight GLM-OCR with 0.9B Parameters; Firefox 148 Browser to Be Released Soon

Welcome to the [AI Daily] column! This is your guide to exploring the world of artificial intelligence every day. Each day, we present you with the latest content in the AI field, focusing on developers, helping you understand technical trends and innovative AI product applications. Click to learn more about new AI products: https://app.aibase.com/zh1, Adobe Firefly announces unlimited AI video and image generation for subscription users. The upgrade of Adobe Firefly provides creative professionals with more powerful AI tools support.

Feb 3, 2026

Refuse to Ruin Classics! State Administration of Radio, Film, and Television Takes Action: Douyin and Xiaohongshu Remove Thousands of AI-Messed-About Videos

Starting January 1, 2026, the State Administration of Radio, Film, and Television will carry out a one-month special governance campaign against "AI-messed-about" videos, focusing on cracking down on the malicious alteration of film and television classics and the images of heroic role models using artificial intelligence technology. The initiative aims to protect social mainstream values and cultural heritage. The scope of the governance includes strictly prohibiting improper adaptations of classic works such as the Four Great Classical Novels.

Feb 3, 2026

Former OpenAI Expert Issues Warning: AI Won't Learn from Mistakes, AGI Faces Key Bottleneck

Although AI large models perform well on reasoning tasks, their core deficiency lies in their inability to learn from mistakes. OpenAI researcher Jerry Tworek pointed out that the lack of an effective correction mechanism after model failure has become a key obstacle in achieving artificial general intelligence.

Feb 3, 2026

AI Scholar Suffers a Setback! GPT-4o Expert Exam Scored Only 2.7 Points

GPT-4o scored only 2.7 out of 100 in a 'human ultimate exam', with the top AI model at just 8 points, raising doubts about AI's true capabilities. Traditional tests fail to reflect real performance due to 'benchmark saturation'.....

Feb 3, 2026

Institution: User enthusiasm for AI applications is high, AIGC app monthly active users increased by more than 200 million

QuestMobile data shows that as of December 2025, the total number of monthly active users in China exceeded 1.276 billion, with an average monthly usage time of 186.2 hours, an 8.4% year-on-year increase. This reflects the continued rising enthusiasm of users for mobile internet services, especially with the promotion of artificial intelligence technology. In December, the AIGC application industry performed outstandingly, with a significant increase in the scale of monthly active users.

Feb 3, 2026

Microsoft Invests Billions in AI but Struggles to Monetize? Copilot Paid Conversion Rate is Only 3.3%

The paid rate of Microsoft's AI product Copilot is only 3.3%, despite having reached 15 million paid seats, but most users still refuse to pay for AI features.

Feb 3, 2026

No need to download an app! Linq secures $20 million in funding, integrating AI assistant natively through iMessage and SMS

The convenience of AI agent interaction has drawn attention, with the startup Linq securing a $20 million Series A round led by TQ Ventures. The funds will be used for team expansion, market promotion, and technology development. Linq transitioned from digital business cards and now focuses on providing AI interaction solutions for enterprises.

Feb 3, 2026

100

Latest AI News

AI Daily Brief

AI Product Finder

AI Product Rankings

AI Product Submit

AI Tools Directory

AI Models Finder

LLM Leaderboard

Model Providers

Compare LLMs

LLM Cost Calculator

LLM Arena

MCP Servers

MCP Client

MCP Case Tutorials

MCP Ranking

MCP Service Submission

MCP Playground

MCP Inspector

GEO Brand Visibility

AI Brand Monitoring Tool

AI Search Visibility Checker

GEO Promotion Link Detection

GEO Ranking Optimization System

GEO Services​

AI Model Compatibility Checker

AI Deployment Calculator

Google releases FACTS benchmark: AI models face a 70% ceiling challenge in accuracy

AIbase基地

This article is from AIbase Daily

AI News Recommendations

Claude 5 Revealed: Anthropic Unveils the Fennec Programming Model, Bringing a Major Shift in the Landscape

WeChat Mini Program Education Platform Freely Opens AI Teaching Projects to Teachers and Students, Exceeding 170,000

New Era of Intelligent Transportation: How the Large Model Gateway is Changing the Landscape of AI Applications

AI Daily: OpenAI Launches macOS Version of Codex App; Zhipu Releases Lightweight GLM-OCR with 0.9B Parameters; Firefox 148 Browser to Be Released Soon

Refuse to Ruin Classics! State Administration of Radio, Film, and Television Takes Action: Douyin and Xiaohongshu Remove Thousands of AI-Messed-About Videos

Former OpenAI Expert Issues Warning: AI Won't Learn from Mistakes, AGI Faces Key Bottleneck

AI Scholar Suffers a Setback! GPT-4o Expert Exam Scored Only 2.7 Points

Institution: User enthusiasm for AI applications is high, AIGC app monthly active users increased by more than 200 million

Microsoft Invests Billions in AI but Struggles to Monetize? Copilot Paid Conversion Rate is Only 3.3%

No need to download an app! Linq secures $20 million in funding, integrating AI assistant natively through iMessage and SMS

AI News Recommendations

Claude 5 Revealed: Anthropic Unveils the Fennec Programming Model, Bringing a Major Shift in the Landscape

WeChat Mini Program Education Platform Freely Opens AI Teaching Projects to Teachers and Students, Exceeding 170,000

New Era of Intelligent Transportation: How the Large Model Gateway is Changing the Landscape of AI Applications

AI Daily: OpenAI Launches macOS Version of Codex App; Zhipu Releases Lightweight GLM-OCR with 0.9B Parameters; Firefox 148 Browser to Be Released Soon

Refuse to Ruin Classics! State Administration of Radio, Film, and Television Takes Action: Douyin and Xiaohongshu Remove Thousands of AI-Messed-About Videos

Former OpenAI Expert Issues Warning: AI Won't Learn from Mistakes, AGI Faces Key Bottleneck

AI Scholar Suffers a Setback! GPT-4o Expert Exam Scored Only 2.7 Points

Institution: User enthusiasm for AI applications is high, AIGC app monthly active users increased by more than 200 million

Microsoft Invests Billions in AI but Struggles to Monetize? Copilot Paid Conversion Rate is Only 3.3%

No need to download an app! Linq secures $20 million in funding, integrating AI assistant natively through iMessage and SMS

GEO Services