The field of long video understanding is rapidly evolving. While numerous existing models achieve strong performance on benchmarks, their substantial memory overhead and high response latency become a critical bottleneck, especially as video input lengths grow. To overcome these limitations and maintain superior performance, we're releasing Video-XL-2. It makes better and faster long video understanding possible, with key features including:
Video-XL-2 consistently achieves state-of-the-art performance across mainstream Long Video Understanding and Temporal Grounding benchmarks when compared against open-source models with similar parameter counts.
Video-XL-2 demonstrates remarkable capabilities across multiple long video benchmarks, scoring 74.9 on MLVU, 66.4 on VideoMME, and 48.6 on LVBench. Notably, Video-XL-2 achieved the lowest average FLOPs despite not having the shortest input length during testing, demonstrating its remarkable efficiency. Video-XL-2 also showcases impressive performance in Temporal Grounding task.
In addition to strong performance, high efficiency for processing long videos is another impressive and obvious advantage. We propose two key innovations: chunk-based pre-filling and selective KVS decoding to largely accelerate it while maintaining performance. As the following figures show, Video-XL-2 delivers exceptional pre-filling speed and memory efficiency. For a fair comparison, both presented results are obtained under native (eager) attention .
Designed for long video understanding, Video-XL-2 leverages a powerful model architecture and a multi-stage training approach.
Video-XL-2 builds upon the robust model architecture of Video-XL Pro, which has demonstrated impressive performance across numerous experiments. Specifically, Video-XL-2 utilizes SigLIP-SO400M as its foundational visual encoder. Following this, a Dynamic Token Synthesize (DTS) Module losslessly compresses visual token sequences by 4x. For the large language model backbone, we've adopted Qwen2.5-Instruct. Further architectural details you can find in our https://arxiv.org/pdf/2503.18478.
Our training process is divided into four distinct stages, each designed to incrementally build and refine the model's capabilities:
Video-XL-2 make a comprehensive acceleration strategy including prefilling and decoding for long video understanding. The strategy is consist of two key innovation techniques: Chunk-based pre-filling and Bi-level KVs Decoding. These approaches make our models achieve low memory demanding and high speed on both pre-filling and decoding.
For prefilling acceleration, previous works have oberseve the attention pattern in video LLM is super sparse. Based on these observation, we encode video chunks one by one, and made the timpestampe to be the information carrier, each timestamps betwee chunks summary the history information and pass it to the later chunks. This design make Video-XL-2 largely reduce the attention flops. The above design can be summaried as the Figure 5.
After prefilling, all KVs prepared in cache for subsequent decoding. The most cases in long video understanding is only fouces on single or multi time spans in long video. So, based on this, we select dense KVs of important video chunks for providing detailed local information and downsample the other KVs to sparse KVs for reducing KV cache demanding and provide coarse -granularity abstract global information. Through this approach, Video-XL-2 counld get better performance in detail undersatanding and high efficiency in decoding.
@article{shu2024video,
title={Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding},
author={Shu, Yan and Zhang, Peitian and Liu, Zheng and Qin, Minghao and Zhou, Junjie and Huang, Tiejun and Zhao, Bo},
journal={arXiv preprint arXiv:2409.14485},
year={2024}
}
@article{liu2025video,
title={Video-XL-Pro: Reconstructive Token Compression for Extremely Long Video Understanding},
author={Liu, Xiangrui and Shu, Yan and Liu, Zheng and Li, Ao and Tian, Yang and Zhao, Bo},
journal={arXiv preprint arXiv:2503.18478},
year={2025}
}