VideoChat Flash

[ICLR2026] VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling

Generate Convert Improve

Install / Use

/learn @OpenGVLab/VideoChat Flash

About this skill

Quality Score

0/100

README

<div align="center"> <h2><a href="https://www.arxiv.org/abs/2501.00574">VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling</a></h2>

Xinhao Li, Yi Wang, Jiashuo Yu, Xiangyu Zeng, Yuhan Zhu, Haian Huang, Jianfei Gao, Kunchang Li, Yinan He, Chenting Wang, Yu Qiao, Yali Wang, and Limin Wang

<p align="center"> 🤗 <a href="https://huggingface.co/collections/OpenGVLab/videochat-flash-6781493748713b5ba2b705e0">Model & Data</a> &nbsp&nbsp ｜ &nbsp&nbsp🖥️ <a href="">Demo</a> &nbsp&nbsp | &nbsp&nbsp 📑 <a href="https://www.arxiv.org/abs/2501.00574">Paper</a> &nbsp&nbsp | &nbsp&nbsp 🌐 <a href="https://internvideo.github.io/blog/2024-12-31-VideoChat-Flash/">Blog</a> <br> </p> </div>

:fire: Updates

[x] 2025/06/13: 🎉🎉🎉Our model achieves promising results on the VideoEval-Pro benchmark focused on long video understanding!
[x] 2025/05/10:🔥🔥🔥 We release most video of our training data, Hope it can be of help to you!
[x] 2025/03/27:🔥🔥 We release our dataset and evaluation codes for single-hop and multi-hop needle-in-a-haystack!
[x] 2025/03/09:🔥🔥 We release our weights of each training stage here, try to build your VideoChat-Flash on them!
[x] 2025/02/25:🔥🔥 We release our training data, training codes based LLaVA for VideoChat-Flash and training codes based XTuner for finetuning InternVideo2.5.
[x] 2025/02/12: 🎉🎉🎉Our VideoChat-Flash-7B@448 has achieved first place on the latest Video Detail Caption Benchmark, AuroraCap.
[x] 2025/01/15: We provide evaluation codes for QA & Grounding Benchmark.
[x] 2025/01/12: 🔥🔥🔥Release VideoChat2-Flash, a powerfull MLLM built on video encoder (InternVideo) and LLM (Qwen).
- We offer five models, VideoChat2-Flash-2B@224 (Small LLM), VideoChat2-Flash-7B@224, VideoChat2-Flash-7B@448 (Overall best), VideoChat-Flash-Qwen2_5-7B-1M (Super long video input) and VideoChat-Flash-Qwen2_5-7B_InternVideo2-1B (Stronger short-term temporal understanding).

📑 Future Plan

[ ] lmdeploy/vllm support for Videochat-Flash and InternVideo2.5
[ ] LoRA finetuning training code for Videochat-Flash and InternVideo2.5
[ ] Mixing image/video training code for InternVideo2.5
[ ] Faster training code with XTuner for VideoChat-Flash

As I am currently very busy with work and find it difficult to complete the above plans quickly, I sincerely ask friends in the community to join in and submit a PR.

:parrot: Introduction

🚀State-of-the-art performance in short and long video understanding, with temporal localization capabilities comparable to expert models. alt text 🔭Supports ultra-long video inputs, achieving a groundbreaking needle-in-a-haystack evaluation accuracy of 99.1% on 10,000 frames, capable of processing videos up to three hours long. ⚡Highly efficient model architecture with exceptional inference speed, encoding each video frame into just 16 tokens, making it 5–10 times faster than the previous model. alt text

Demo & Inference

Refer to hf README to inference our model.

Evaluation

See evaluation codes. And lmms-eval have supported our model, you also could use it to evaluate our model on varous benchmarks.

Training

See training codes based LLaVA for VideoChat-Flash and training codes based XTuner for finetuning InternVideo2.5.

:bar_chart: NIAH

alt text

See xtuner-eval_niah for evaluation of Single-Hop NIAH-Video and Multi-Hop NIAH-Video.

:page_facing_up: Citation

If you find this project useful in your research, please consider cite:

@article{li2024videochat,
  title={VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling},
  author={Li, Xinhao and Wang, Yi and Yu, Jiashuo and Zeng, Xiangyu and Zhu, Yuhan and Huang, Haian and Gao, Jianfei and Li, Kunchang and He, Yinan and Wang, Chenting and Qiao, Yu and Wang, Yali and Wang, Limin},
  journal={arXiv preprint arXiv:2501.00574},
  year={2024}
}

:dizzy: Acknowledgement

Thanks to the open source of the following projects: InternVideo, UMT, Qwen, LLaVA-VL, lmms-eval, Ask-Anything, ToMe, LongVLM, FastV, LLaVolta, PyramidDrop, LongVA, their implementation provides valuable reference experience for our project.

Related Skills

docs-writer

99.2k

`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie

model-usage

337.3k

Use CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.

arscontexta

2.9k

Claude Code plugin that generates individualized knowledge systems from conversation. You describe how you think and work, have a conversation and get a complete second brain as markdown files you own.

zola-ai

An autonomous Solana wallet agent that executes payments via Twitter mentions and an in-app dashboard, powered by Claude.