New King of Long Text Understanding? Gemini2.5Pro Beats o3 and Leads Fiction.Live Benchmark

AIbase基地

Published inAI News · 4 min read · Jun 9, 2025

In recent Fiction.Live benchmark tests, Gemini2.5Pro performed exceptionally well in understanding and reproducing complex stories and contexts, outpacing its competitor, OpenAI's o3 model. This test goes far beyond traditional "needle in a haystack"-type tasks, focusing on the model's ability to handle deep semantics and background-dependent information within vast contexts.

Google's Gemini model

According to test data, when the context window length reached 192,000 tokens (approximately 144,000 words), the o3 model's performance plummeted, while Gemini2.5Pro's June preview version (preview-06-05) maintained an accuracy rate of over 90% under the same conditions.

Notably, OpenAI's o3 model maintained perfect accuracy at 8K tokens but showed fluctuations as the context expanded to 16K~60K, ultimately "collapsing" at 192K. In contrast, although Gemini2.5Pro showed a slight decline at 8K, it stabilized its performance until 192K.

Although Gemini2.5Pro claims to support context windows up to one million tokens, current testing is still far from reaching its theoretical limit. Meanwhile, o3's maximum window size is 200K, while Meta's released Llama4Maverick claims to process up to ten million tokens, but in actual tasks, it has been pointed out that it ignores a lot of important information, resulting in performance below expectations.

Deep understanding capabilities cannot be achieved by simply "stacking parameters".

Nikolay Savinov from DeepMind pointed out that "more information does not necessarily mean better." He explained that the challenge brought by large contexts lies in the allocation of attention mechanisms: when focusing on certain information, other parts will inevitably be ignored, which may reduce overall performance. He suggested that users should prioritize deleting irrelevant pages and reducing redundant content when using models to process large documents to improve the quality of model processing.

Overall, Fiction.Live benchmark tests provide a more realistic and application-oriented way to evaluate language model capabilities. Gemini2.5Pro demonstrated its strong capabilities in long-text understanding in this test, also hinting at the industry: future large-scale model competition will no longer be about "whose window is larger", but rather "who uses it more intelligently".

Fiction.Live Gemini2.5Pro OpenAI o3 model

This article is from AIbase Daily

Welcome to the [AI Daily] column! This is your daily guide to exploring the world of artificial intelligence. Every day, we present you with hot topics in the AI field, focusing on developers, helping you understand technical trends, and learning about innovative AI product applications.

—— Created by the AIbase Daily Team

AI News Recommendations

iFlytek Spark X1 Advanced Reasoning Large Model Upgraded Version Launches with Significant Enhancements in Multiple Dimensions

iFlytek announced the official launch of the upgraded version of its deep reasoning large model, iFlytek Spark X1, which is trained using fully domestically developed computing power. This upgrade represents a comprehensive advancement, maintaining an edge over the latest versions of top domestic and international large models such as OpenAI's o3 in overall performance. It has also made significant progress in areas such as hallucination management, multilingual capabilities, and voice simultaneous translation, providing users with a smarter, more reliable, and more efficient AI assistant. The upgraded version of iFlytek Spark X1 has significantly improved capabilities in translation, reasoning, text generation, and mathematics.

Jul 25, 2025

Kuaishou Opensources KAT-V1 Large Model: Significant Improvement in Autonomous Thinking Ability 40B Version Performance Close to 40B Performance Approaching R1-0528

Kuaishou opensources the KAT-V1 autonomous thinking large model, which includes two versions: 40B and 200B. The 40B version performance is close to DeepSeek-R1, and the 200B version outperforms several flagship models. The model innovatively adopts a mixed training paradigm of short and long-term thinking and the Step-SRPO reinforcement learning algorithm, which can automatically adjust the thinking mode based on the complexity of the question, solving the problem of overthinking. Based on Qwen2.5-32B, it achieves excellent performance in fields such as science and code through an heterogeneous distillation framework and pre-training with 10 million examples.

Jul 25, 2025

Nanyang Technological University Collaborates with Shanghai AI Lab to Release PhysX-3D: Infusing Physical Soul into AI-Generated 3D Models!

Jul 25, 2025

Anthropic Launches Audit Agent to Aid in AI Model Alignment Testing

Anthropic introduces AI audit agents (investigation, evaluation, red-teaming) to enhance model alignment testing. Agents enable parallel audits, detect biases with 42% success rate, addressing manual audit limitations. Code open-sourced on GitHub.....

Jul 25, 2025

OpenAI Financial Crisis! Urgently Needs $3 Billion to Support the Star Gate Project

OpenAI seeks $40B funding for 'Stargate' AI project amid financial crisis. SoftBank's $10B investment may withdraw due to location disputes. With $10B revenue but $5B annual loss, profitability is expected by 2029. Recent $30B Oracle deal and talks with Saudi/Indian investors. Internal issues (failed acquisitions, talent drain) and Meta competition raise market concerns.....

Jul 25, 2025

Qwen3-Coder, Alibaba Tongyi Qianwen AI Programming Large Model, Ranks First

Alibaba's Qwen3-Coder tops Hugging Face's model rankings as the hottest open-source AI coding model, outperforming GPT-4.1 and Claude4 with its MoE architecture, excelling in multi-agent tasks, and surpassing 20M downloads. Tech leaders praise it as a breakthrough for Chinese open-source AI.....

Jul 25, 2025

ChatGPT Agent is now available to all Plus, Pro, and Team users

Jul 25, 2025

ChatGPT Agent Function Fully Launched, Plus, Pro, and Team Users Can Now Experience It

OpenAI launched ChatGPT's agent feature for paid users on July 24, 2025, integrating Operator and Deep Research tools to enable AI to autonomously handle multi-step tasks like travel planning and financial reports. It excels in benchmarks with 27.4% math accuracy. Plus/Pro/Team users can access it now, with enterprise rollout coming.....

Jul 25, 2025

Qwen-MT Machine Translation Model Launched by Tongyi Qianwen, Built on Qwen 3

The Qwen-MT machine translation model is officially released. Qwen-MT is based on the powerful Qwen 3 model, and has been trained on a massive amount of multilingual and translation data. It combines reinforcement learning technology to significantly improve the accuracy and fluency of translation results. Developers can experience its fast and accurate translation capabilities directly through the Qwen API (qwen-mt-turbo). The core highlights of Qwen-MT include supporting bidirectional translation between 92 languages, covering more than 95% of the global population, and meeting a wide range of language communication needs.

Jul 25, 2025

PixelBloom Completes Strategic B3 Funding Round, Accelerating AI Office Ecosystem Development

AI office company PixelBloom completes B3 funding round, led by Yizhuang Guotou, with participation from multiple renowned institutions. The funds will be used for two strategic initiatives: accelerating global expansion and consolidating the leading position of AiPPT.com in international markets; expanding the AI Venture Studio model to incubate innovative office products. This funding round reflects market recognition of its technological capabilities and development prospects. In the future, it will promote the construction of the AI office ecosystem and enhance the intelligent office experience for global users.

Jul 24, 2025

Product Finder

Product Submit

AI Models Finder

MCP Servers

MCP Client

MCP Inspector

Case Tutorials

Latest AI News

AI Daily Brief

New King of Long Text Understanding? Gemini2.5Pro Beats o3 and Leads Fiction.Live Benchmark

AIbase基地

This article is from AIbase Daily

AI News Recommendations

iFlytek Spark X1 Advanced Reasoning Large Model Upgraded Version Launches with Significant Enhancements in Multiple Dimensions

Kuaishou Opensources KAT-V1 Large Model: Significant Improvement in Autonomous Thinking Ability 40B Version Performance Close to 40B Performance Approaching R1-0528

Nanyang Technological University Collaborates with Shanghai AI Lab to Release PhysX-3D: Infusing Physical Soul into AI-Generated 3D Models!

Anthropic Launches Audit Agent to Aid in AI Model Alignment Testing

OpenAI Financial Crisis! Urgently Needs $3 Billion to Support the Star Gate Project

Qwen3-Coder, Alibaba Tongyi Qianwen AI Programming Large Model, Ranks First

ChatGPT Agent is now available to all Plus, Pro, and Team users

ChatGPT Agent Function Fully Launched, Plus, Pro, and Team Users Can Now Experience It

Qwen-MT Machine Translation Model Launched by Tongyi Qianwen, Built on Qwen 3

PixelBloom Completes Strategic B3 Funding Round, Accelerating AI Office Ecosystem Development