In recent Fiction.Live benchmark tests, Gemini2.5Pro performed exceptionally well in understanding and reproducing complex stories and contexts, outpacing its competitor, OpenAI's o3 model. This test goes far beyond traditional "needle in a haystack"-type tasks, focusing on the model's ability to handle deep semantics and background-dependent information within vast contexts.
According to test data, when the context window length reached 192,000 tokens (approximately 144,000 words), the o3 model's performance plummeted, while Gemini2.5Pro's June preview version (preview-06-05) maintained an accuracy rate of over 90% under the same conditions.
Notably, OpenAI's o3 model maintained perfect accuracy at 8K tokens but showed fluctuations as the context expanded to 16K~60K, ultimately "collapsing" at 192K. In contrast, although Gemini2.5Pro showed a slight decline at 8K, it stabilized its performance until 192K.
Although Gemini2.5Pro claims to support context windows up to one million tokens, current testing is still far from reaching its theoretical limit. Meanwhile, o3's maximum window size is 200K, while Meta's released Llama4Maverick claims to process up to ten million tokens, but in actual tasks, it has been pointed out that it ignores a lot of important information, resulting in performance below expectations.
Deep understanding capabilities cannot be achieved by simply "stacking parameters".
Nikolay Savinov from DeepMind pointed out that "more information does not necessarily mean better." He explained that the challenge brought by large contexts lies in the allocation of attention mechanisms: when focusing on certain information, other parts will inevitably be ignored, which may reduce overall performance. He suggested that users should prioritize deleting irrelevant pages and reducing redundant content when using models to process large documents to improve the quality of model processing.
Overall, Fiction.Live benchmark tests provide a more realistic and application-oriented way to evaluate language model capabilities. Gemini2.5Pro demonstrated its strong capabilities in long-text understanding in this test, also hinting at the industry: future large-scale model competition will no longer be about "whose window is larger", but rather "who uses it more intelligently".