BAAI Open-sources Lightweight Ultra-long Video Understanding Model Video-XL-2

AIbase基地

Published inAI News · 8 min read · Jun 3, 2025

Recently, the Zhikun Research Institute, in collaboration with Shanghai Jiao Tong University and other institutions, officially released Video-XL-2, a new generation of ultra-long video understanding model. This release marks a significant breakthrough in the open-source field for long video understanding technology, injecting new vitality into the development of multimodal large models in understanding long video content.

In terms of technical architecture, Video-XL-2 is primarily composed of three core components: the visual encoder, the Dynamic Token Synthesis module (DTS), and the large language model (LLM). The model uses SigLIP-SO400M as its visual encoder to process input videos frame by frame, encoding each frame into high-dimensional visual features. Subsequently, the DTS module fuses and compresses these visual features while modeling their temporal relationships to extract more semantic dynamic information. The processed visual representations are further mapped into text embedding space through average pooling and multilayer perceptron (MLP) to achieve modality alignment. Finally, the aligned visual information is fed into Qwen2.5-Instruct to understand and reason about the visual content and complete corresponding downstream tasks.

In terms of training strategy, Video-XL-2 adopts a four-stage progressive training design to progressively build its powerful long video understanding capabilities. The first two stages mainly utilize image/video-text pairs to initialize the DTS module and achieve cross-modal alignment; the third stage introduces larger-scale, higher-quality image and video description data to preliminarily establish the model's ability to understand visual content; the fourth stage fine-tunes on large-scale, high-quality, and diverse image and video instruction data, further enhancing and strengthening Video-XL-2's visual understanding capabilities so that it can more accurately understand and respond to complex visual instructions.

In addition, Video-XL-2 systematically designs efficiency optimization strategies. It introduces a segmented prefilling strategy (Chunk-based Prefilling), dividing ultra-long videos into several continuous segments (chunks). Within each chunk, dense attention mechanisms are used for encoding, while different chunks pass contextual information through timestamps, significantly reducing computational costs and memory overhead during the prefilling phase. Meanwhile, Video-XL-2 also designs a decoding mechanism based on dual-granularity KV (Bi-granularity KV Decoding). During inference, the model selectively loads complete KVs (dense KVs) for key segments based on task requirements, while only loading downsampled sparse KVs (sparse KVs) for other secondary segments, effectively shortening the inference window length and greatly improving decoding efficiency. Thanks to the collaborative optimization of these strategies, Video-XL-2 achieves efficient inference for videos at the ten-thousand-frame level on a single GPU, significantly enhancing its practicality in real-world applications.

In terms of experimental results, Video-XL-2 surpasses all lightweight open-source models on mainstream long video evaluation benchmarks such as MLVU, VideoMME, and LVBench, achieving the current state-of-the-art performance (SOTA). Notably, on MLVU and LVBench, Video-XL-2's performance approaches or even exceeds that of large models with parameter scales up to 72 billion, such as Qwen2.5-VL-72B and LLaVA-Video-72B. Additionally, in the temporal grounding (Temporal Grounding) task, Video-XL-2 achieved leading results on the Charades-STA dataset, further validating its broad applicability and practical value in multimodal video understanding scenarios.

In terms of video length, Video-XL-2 demonstrates significant advantages. On a single consumer-grade GPU with 24GB (such as RTX3090/4090), Video-XL-2 can handle videos up to thousands of frames; on a single high-performance GPU with 80GB (such as A100/H100), the model supports input of ten-thousand-frame-level videos, far exceeding existing mainstream open-source models. Compared to VideoChat-Flash and the first-generation Video-XL, Video-XL-2 significantly extends the length of video understanding and effectively reduces resource requirements, providing strong support for handling complex video tasks.

In terms of speed, Video-XL-2 also demonstrates outstanding performance. It takes only 12 seconds to complete the prefilling of a 2048-frame video, showing an approximately linear growth between prefilling time and the number of input frames, demonstrating its excellent scalability. In contrast, Video-XL and VideoChat-Flash lag behind Video-XL-2 in terms of work efficiency when processing long videos.

Thanks to its excellent video understanding capabilities and efficient processing performance for ultra-long videos, Video-XL-2 shows high application potential in various real-world applications. For example, in film content analysis, it can quickly and accurately understand movie plots and answer related questions; in surveillance videos, it can detect abnormal behaviors and issue safety warnings; additionally, it can be used for summarizing film and television works as well as analyzing game live streaming content, providing efficient and precise technical support for complex video understanding needs in the real world.

Currently, the model weights of Video-XL-2 have been fully opened to the community, with project homepages, model links, and repository links announced. In the future, this model is expected to play an important role in more real-world scenarios, further promoting the development of long video understanding technology.

Project homepage:

https://unabletousegit.github.io/video-xl2.github.io/

Model hf link:

https://huggingface.co/BAAI/Video-XL-2

Repository link:

https://github.com/VectorSpaceLab/Video-XL

Tencent Hunyuan Game Visual Generation Platform Officially Launches Version 2.0

On September 5, the Tencent Hunyuan Game Visual Generation Platform officially launched version 2.0, adding capabilities such as image-to-video generation for games, custom model training, and one-click character refinement. It also significantly enhanced the 2D image generation model for games. The image-to-video and text-to-image models achieved industry-leading performance in game scenarios. This upgrade further addresses pain points in the dynamic content generation, style customization, and detail optimization during game art design and promotion, helping game art designers improve efficiency.

Apple AI Leader Joins Meta AI, Talent Drain Intensifies

Jian Zhang, Apple's Chief AI Researcher, has officially left the company and joined Meta Platforms Inc.'s robotics studio. Jian Zhang's departure marks a growing talent drain for Apple in the field of artificial intelligence, especially in the development of robotics technology. Meta has confirmed Jian Zhang's addition. This researcher was responsible for the development of automation technology and AI products at Apple. His team had previously worked internally within Apple.

Apple Opensources FastVLM and MobileCLIP2 with an 85-Fold Speed Increase: iPhones Turn into AI Powerhouses in a Flash!

Apple quietly open-sourced FastVLM and MobileCLIP2 on Hugging Face, two high-performance vision-language models optimized for edge devices. FastVLM, built on Apple's MLX framework, offers 85x faster processing for high-res images on Apple Silicon, while MobileCLIP2 provides efficient multimodal capabilities. Both models demonstrate Apple's focus on on-device AI performance.....

Latest AI News

AI Daily Brief

AI Product Finder

AI Product Rankings

AI Product Submit

AI Tools Directory

AI Models Finder

Model Providers

Submit Your Model

Compare LLMs

LLM Cost Calculator

LLM Arena

MCP Servers

MCP Client

MCP Case Tutorials

MCP Ranking

MCP Service Submission

MCP Playground

MCP Inspector

GEO Services

BAAI Open-sources Lightweight Ultra-long Video Understanding Model Video-XL-2

AIbase基地

This article is from AIbase Daily

AI News Recommendations

Tencent Hunyuan Game Visual Generation Platform Officially Launches Version 2.0

Moonshot AI Releases Kimi K2-0905: High-speed API Supporting 60-100 Tokens/s Now Fully Opened

Microsoft officially launches GPT-realtime model, focusing on more realistic voice and multimodal input

Google's Veo 3 Video Generation Model Launches on Google Photos, Turning Static Photos into Dynamic Videos

AI Daily: Apple to Launch Siri AI Search Next Year; OpenAI Unlocks ChatGPT Projects Feature; Kimi K2-0905 Launches on Discord

Kimi K2-0905 Launches on Discord Still Lacks Reasoning and Visual Capabilities

Wenzhou Officially Establishes Artificial Intelligence Bureau, Zhejiang Takes the Lead in AI Governance

Apple AI Leader Joins Meta AI, Talent Drain Intensifies

Apple Opensources FastVLM and MobileCLIP2 with an 85-Fold Speed Increase: iPhones Turn into AI Powerhouses in a Flash!

Liquid AI Launches LFM2-VL Model, Bringing 'Compact and Sensitive' AI Vision and Language Capabilities to Mobile Devices

Latest AI News

AI Daily Brief

AI Product Finder

AI Product Rankings

AI Product Submit

AI Tools Directory

AI Models Finder

Model Providers

Submit Your Model

Compare LLMs

LLM Cost Calculator

LLM Arena

MCP Servers

MCP Client

MCP Case Tutorials

MCP Ranking

MCP Service Submission

MCP Playground

MCP Inspector

GEO Services​

BAAI Open-sources Lightweight Ultra-long Video Understanding Model Video-XL-2

AIbase基地

This article is from AIbase Daily

AI News Recommendations

Tencent Hunyuan Game Visual Generation Platform Officially Launches Version 2.0

Moonshot AI Releases Kimi K2-0905: High-speed API Supporting 60-100 Tokens/s Now Fully Opened

Microsoft officially launches GPT-realtime model, focusing on more realistic voice and multimodal input

Google's Veo 3 Video Generation Model Launches on Google Photos, Turning Static Photos into Dynamic Videos

AI Daily: Apple to Launch Siri AI Search Next Year; OpenAI Unlocks ChatGPT Projects Feature; Kimi K2-0905 Launches on Discord

Kimi K2-0905 Launches on Discord Still Lacks Reasoning and Visual Capabilities

Wenzhou Officially Establishes Artificial Intelligence Bureau, Zhejiang Takes the Lead in AI Governance

Apple AI Leader Joins Meta AI, Talent Drain Intensifies

Apple Opensources FastVLM and MobileCLIP2 with an 85-Fold Speed Increase: iPhones Turn into AI Powerhouses in a Flash!

Liquid AI Launches LFM2-VL Model, Bringing 'Compact and Sensitive' AI Vision and Language Capabilities to Mobile Devices

GEO Services