Recently, the Zhikun Research Institute, in collaboration with Shanghai Jiao Tong University and other institutions, officially released Video-XL-2, a new generation of ultra-long video understanding model. This release marks a significant breakthrough in the open-source field for long video understanding technology, injecting new vitality into the development of multimodal large models in understanding long video content.
In terms of technical architecture, Video-XL-2 is primarily composed of three core components: the visual encoder, the Dynamic Token Synthesis module (DTS), and the large language model (LLM). The model uses SigLIP-SO400M as its visual encoder to process input videos frame by frame, encoding each frame into high-dimensional visual features. Subsequently, the DTS module fuses and compresses these visual features while modeling their temporal relationships to extract more semantic dynamic information. The processed visual representations are further mapped into text embedding space through average pooling and multilayer perceptron (MLP) to achieve modality alignment. Finally, the aligned visual information is fed into Qwen2.5-Instruct to understand and reason about the visual content and complete corresponding downstream tasks.
In terms of training strategy, Video-XL-2 adopts a four-stage progressive training design to progressively build its powerful long video understanding capabilities. The first two stages mainly utilize image/video-text pairs to initialize the DTS module and achieve cross-modal alignment; the third stage introduces larger-scale, higher-quality image and video description data to preliminarily establish the model's ability to understand visual content; the fourth stage fine-tunes on large-scale, high-quality, and diverse image and video instruction data, further enhancing and strengthening Video-XL-2's visual understanding capabilities so that it can more accurately understand and respond to complex visual instructions.
In addition, Video-XL-2 systematically designs efficiency optimization strategies. It introduces a segmented prefilling strategy (Chunk-based Prefilling), dividing ultra-long videos into several continuous segments (chunks). Within each chunk, dense attention mechanisms are used for encoding, while different chunks pass contextual information through timestamps, significantly reducing computational costs and memory overhead during the prefilling phase. Meanwhile, Video-XL-2 also designs a decoding mechanism based on dual-granularity KV (Bi-granularity KV Decoding). During inference, the model selectively loads complete KVs (dense KVs) for key segments based on task requirements, while only loading downsampled sparse KVs (sparse KVs) for other secondary segments, effectively shortening the inference window length and greatly improving decoding efficiency. Thanks to the collaborative optimization of these strategies, Video-XL-2 achieves efficient inference for videos at the ten-thousand-frame level on a single GPU, significantly enhancing its practicality in real-world applications.
In terms of experimental results, Video-XL-2 surpasses all lightweight open-source models on mainstream long video evaluation benchmarks such as MLVU, VideoMME, and LVBench, achieving the current state-of-the-art performance (SOTA). Notably, on MLVU and LVBench, Video-XL-2's performance approaches or even exceeds that of large models with parameter scales up to 72 billion, such as Qwen2.5-VL-72B and LLaVA-Video-72B. Additionally, in the temporal grounding (Temporal Grounding) task, Video-XL-2 achieved leading results on the Charades-STA dataset, further validating its broad applicability and practical value in multimodal video understanding scenarios.
In terms of video length, Video-XL-2 demonstrates significant advantages. On a single consumer-grade GPU with 24GB (such as RTX3090/4090), Video-XL-2 can handle videos up to thousands of frames; on a single high-performance GPU with 80GB (such as A100/H100), the model supports input of ten-thousand-frame-level videos, far exceeding existing mainstream open-source models. Compared to VideoChat-Flash and the first-generation Video-XL, Video-XL-2 significantly extends the length of video understanding and effectively reduces resource requirements, providing strong support for handling complex video tasks.
In terms of speed, Video-XL-2 also demonstrates outstanding performance. It takes only 12 seconds to complete the prefilling of a 2048-frame video, showing an approximately linear growth between prefilling time and the number of input frames, demonstrating its excellent scalability. In contrast, Video-XL and VideoChat-Flash lag behind Video-XL-2 in terms of work efficiency when processing long videos.
Thanks to its excellent video understanding capabilities and efficient processing performance for ultra-long videos, Video-XL-2 shows high application potential in various real-world applications. For example, in film content analysis, it can quickly and accurately understand movie plots and answer related questions; in surveillance videos, it can detect abnormal behaviors and issue safety warnings; additionally, it can be used for summarizing film and television works as well as analyzing game live streaming content, providing efficient and precise technical support for complex video understanding needs in the real world.
Currently, the model weights of Video-XL-2 have been fully opened to the community, with project homepages, model links, and repository links announced. In the future, this model is expected to play an important role in more real-world scenarios, further promoting the development of long video understanding technology.
Project homepage:
https://unabletousegit.github.io/video-xl2.github.io/
Model hf link:
https://huggingface.co/BAAI/Video-XL-2
Repository link:
https://github.com/VectorSpaceLab/Video-XL