Logo Video-XL-2

A Better, Faster, and High-Frame-Count Model for Long Video Understanding.

1Beijing Academy of Artificial Intelligence 2University of Trento 3Beijing University of Posts and Telecommunications 4Renmin University of China 5Shanghai Jiao Tong University
*Equal Contribution

Corresponding Authors

Abstract

The field of long video understanding is rapidly evolving. While numerous existing models achieve strong performance on benchmarks, their substantial memory overhead and high response latency become a critical bottleneck, especially as video input lengths grow. To overcome these limitations and maintain superior performance, we're releasing Video-XL-2. It makes better and faster long video understanding possible, with key features including:

  • State-of-the-art performance among existing open-source models with comparable parameters (as of 2025.6.1).
  • High speed and ultra-low memory usage for processing videos of any length.

Performance and Efficiency

Video-XL-2 consistently achieves state-of-the-art performance across mainstream Long Video Understanding and Temporal Grounding benchmarks when compared against open-source models with similar parameter counts.

SOAT Performance

Video-XL-2 demonstrates remarkable capabilities across multiple long video benchmarks, scoring 74.9 on MLVU, 66.4 on VideoMME, and 48.6 on LVBench. Notably, Video-XL-2 achieved the lowest average FLOPs despite not having the shortest input length during testing, demonstrating its remarkable efficiency. Video-XL-2 also showcases impressive performance in Temporal Grounding task.

Video-XL-2-8B 在长视频理解和时间定位基准测试中的性能对比图,展示其超越其他模型的表现。
Table 1: The performance of Video-XL-2 on long video undersatanding benchmarks and temporal grounding benchmarks.

Leading Efficiency

In addition to strong performance, high efficiency for processing long videos is another impressive and obvious advantage. We propose two key innovations: chunk-based pre-filling and selective KVS decoding to largely accelerate it while maintaining performance. As the following figures show, Video-XL-2 delivers exceptional pre-filling speed and memory efficiency. For a fair comparison, both presented results are obtained under native (eager) attention .

pre-filling speed
Figure 1: Prefilling Speed Comparison.
memory efficiency
Figure 2: Video Frame Processing Comparison Across Models and GPUs.

Model Architecture and Training Strategy

Designed for long video understanding, Video-XL-2 leverages a powerful model architecture and a multi-stage training approach.

Architecture Overview

Video-XL-2 builds upon the robust model architecture of Video-XL Pro, which has demonstrated impressive performance across numerous experiments. Specifically, Video-XL-2 utilizes SigLIP-SO400M as its foundational visual encoder. Following this, a Dynamic Token Synthesize (DTS) Module losslessly compresses visual token sequences by 4x. For the large language model backbone, we've adopted Qwen2.5-Instruct. Further architectural details you can find in our https://arxiv.org/pdf/2503.18478.

memory efficiency
Figure 3: Model Architecture.

Training Strategy

Our training process is divided into four distinct stages, each designed to incrementally build and refine the model's capabilities:

training
Figure 4: The training strategy of Video-XL-2.

Acceleration Approach

Video-XL-2 make a comprehensive acceleration strategy including prefilling and decoding for long video understanding. The strategy is consist of two key innovation techniques: Chunk-based pre-filling and Bi-level KVs Decoding. These approaches make our models achieve low memory demanding and high speed on both pre-filling and decoding.

Prefilling: Chunk-based pre-filling

For prefilling acceleration, previous works have oberseve the attention pattern in video LLM is super sparse. Based on these observation, we encode video chunks one by one, and made the timpestampe to be the information carrier, each timestamps betwee chunks summary the history information and pass it to the later chunks. This design make Video-XL-2 largely reduce the attention flops. The above design can be summaried as the Figure 5.

memory efficiency
Figure 5:Chunk-based pre-filling.

Dedoding: Bi-level KVs Decoding

After prefilling, all KVs prepared in cache for subsequent decoding. The most cases in long video understanding is only fouces on single or multi time spans in long video. So, based on this, we select dense KVs of important video chunks for providing detailed local information and downsample the other KVs to sparse KVs for reducing KV cache demanding and provide coarse -granularity abstract global information. Through this approach, Video-XL-2 counld get better performance in detail undersatanding and high efficiency in decoding.

memory efficiency
Figure 6: Bi-level KVs Decoding.

BibTeX

@article{shu2024video,
  title={Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding},
  author={Shu, Yan and Zhang, Peitian and Liu, Zheng and Qin, Minghao and Zhou, Junjie and Huang, Tiejun and Zhao, Bo},
  journal={arXiv preprint arXiv:2409.14485},
  year={2024}
}

@article{liu2025video,
  title={Video-XL-Pro: Reconstructive Token Compression for Extremely Long Video Understanding},
  author={Liu, Xiangrui and Shu, Yan and Liu, Zheng and Li, Ao and Tian, Yang and Zhao, Bo},
  journal={arXiv preprint arXiv:2503.18478},
  year={2025}
}