Google's Gemini 2.5 Pro has further enhanced its video understanding capabilities. This flagship AI model not only supports video analysis of up to six hours but also features an ultra-large context window with up to two million tokens, while also achieving the first-ever direct parsing of YouTube links via API. Official data shows that the model reached an accuracy rate of 84.7% on the VideoMME benchmark test, just a hair’s breadth away from the industry-leading level of 85.2%, showcasing its formidable strength. This breakthrough technology is now available for developers to experience through Google AI Studio.

Gemini 2.5 Pro can handle approximately six hours of video content at once thanks to its large context window (sampling one frame per second with 66 tokens per frame). Developers can now directly input YouTube links via simple API calls, allowing the model to automatically understand, analyze, and transform video content. In the demonstration of the Google Cloud Next '25 opening video, the model successfully identified 16 different product showcase segments, accurately combining audio and video clues to locate content, demonstrating its deep comprehension ability.

QQ20250512-090756.jpg

What’s even more impressive is its instant positioning and cross-time analysis capability. Gemini 2.5 Pro can quickly locate key moments in videos based on user prompts. For example, it precisely counted 17 independent events of the protagonist using a mobile phone in a continuous video. Its logical judgment ability also supports complex time reasoning tasks, analyzing the sequence or frequency of events in the video. The technology behind this is Google’s adoption of 3D-JEPA and multimodal fusion technology, which significantly enhances the video understanding depth and accuracy by combining audiovisual information and code data.

In terms of application scenarios, Gemini 2.5 Pro brings innovative possibilities to multiple fields. In education, the model can automatically generate interactive learning applications based on teaching videos, significantly increasing student engagement; in creative industries, it can convert video content into p5.js animations or interactive visualizations, providing creators with efficient tools; in business analysis scenarios, the model can intelligently parse meeting or product demonstration videos, automatically extract key information, and generate professional reports.

It is worth noting that Google has provided a low-resolution processing mode (occupying only 66 tokens per frame) to further reduce the cost of long video processing. Official tests show that this economic mode only saw a performance drop of 0.5% on the VideoMME test, achieving an excellent balance between cost and performance, offering developers more choices in practical applications.

The video understanding breakthrough of Gemini 2.5 Pro marks the shift of AI from language-centric to video-driven multimodal products. Its 2 million token context window and YouTube link parsing function provide developers with unprecedented creative space, especially in high-value areas such as education, entertainment, and corporate analysis. Nevertheless, industry experts point out that there is still room for improvement in the latency optimization when handling ultra-long videos. Google plans to further expand the context window and integrate more multimodal functions, such as real-time streaming processing, to meet the growing market demand and continue leading the development direction of AI visual capabilities.