FlashVID
[ICLR 2026 Oral] FlashVID: Efficient Video Large Language Models via Training-free Tree-based Spatiotemporal Token Merging
Install / Use
/learn @Fanziyang-v/FlashVIDREADME
FlashVID: Efficient Video Large Language Models via Training-free Tree-based Spatiotemporal Token Merging
<div align="center"> <a href="https://github.com/Fanziyang-v">Ziyang Fan</a><sup>1</sup>, <a href="https://github.com/Mirei124">Keyu Chen</a><sup>1</sup>, <a href="https://github.com/xrlexpert">Ruilong Xing</a><sup>1</sup>, <a href="https://github.com/yu-lin-li">Yulin Li</a><sup>1</sup>, <a href="https://github.com/llijiang">Li Jiang</a><sup>2,3</sup>, <a href="https://github.com/tianzhuotao">Zhuotao Tian</a><sup>1,3*</sup> <br> <sup>1</sup> Harbin Institute of Technology (Shenzhen) <sup>2</sup> The Chinese University of Hong Kong (Shenzhen)<br> <sup>3</sup> Shenzhen Loop Area Institute <br> <sup>*</sup>Corresponding Author <br> <a href='https://iclr.cc/'><img src='https://img.shields.io/badge/ICLR-2026-78CA2E.svg'></a> <a href='https://openreview.net/forum?id=H6rDX4w6Al'><img src='https://img.shields.io/badge/Paper-Openreview-8D1B12.svg'></a> <a href="LICENSE"><img src="https://img.shields.io/badge/License-MIT-yellow.svg" alt="License"></a> <a href='https://fanziyang-v.github.io/FlashVID/'><img src='https://img.shields.io/badge/Project-Page-Green'></a> <a href="https://arxiv.org/abs/2602.08024"><img src="https://img.shields.io/badge/arXiv-2602.08024-b31b1b.svg"></a> <a href="https://huggingface.co/"><img src="https://img.shields.io/badge/transformers-4.57-FFD21E.svg" alt="transformers"></a> <!-- <a href="https://python.org/"><img src="https://img.shields.io/badge/Python-3.10%2B-3776ab.svg" alt="Python"></a> --> <!-- <a href="https://pytorch.org/"><img src="https://img.shields.io/badge/PyTorch-2.5%2B-DF3411.svg" alt="PyTorch"></a> --> <!-- <a href="#"><img src="https://img.shields.io/badge/#.svg"></a> --> </div>🔖Table of Contents
- News
- Todo List
- Highlights
- Motivation
- Method
- Installation
- Quickstart
- Evaluation
- Acknowledgement
- Citation
🔥News
- [2026.02.10] 🚀Release our paper on arXiv.
- [2026.02.06] 🍾Our paper has been selected as an Oral Presentation at ICLR 2026.
- [2026.02.01] ✨Release FlashVID code and inference demos on Qwen2.5-VL and Qwen3-VL.
- [2026.01.31] 🚀Release this repository to the public.
- [2026.01.30] ✨Release FlashVID code and inference demos on LLaVA-OneVision and LLaVA-Video.
- [2026.01.30] 👏Initialize this GitHub repository.
- [2026.01.26] 🎉Our training-free inference acceleration method FlashVID has been accepted at ICLR 2026.
- [2025.12.06] 🌟Release the GitHub repository of DyTok.
- [2025.09.18] 🎉 Our training-free inference acceleration framework DyTok has been accepted at NeurIPS 2025.
📋Todo List
- [ ] Optimize code efficiency
- [x] Release FlashVID code on LLaVA-OneVision and LLaVA-Video.
- [x] Release inference demos on different Video LLMs with FlashVID.
- [x] Support evaluation using LMMs-Eval.
- [x] Release FlashVID code on Qwen2.5-VL and Qwen3-VL.
- [x] Release our paper on arXiv.
✨Highlights

- Our FlashVID significantly outperforms previous state-of-the-art acceleration frameworks (e.g., VisionZip, FastVID) across three representative VLLMs (i.e., LLaVA-OneVision, LLaVA-Video, Qwen2.5-VL) on five widely used video understanding benchmarks (i.e., VideoMME, EgoSchema, LongVideoBench, MVBench, MLVU).
- FlashVID can serve as a training-free and plug-and-play module for extending long video frames, enabling a 10x increase in video frame input to Qwen2.5-VL, resulting in 8.6% within the same computational budget.
- Existing efficient Video LLM methods often independently compress spatial and temporal redundancy, overlooking the intrinsic spatiotemporal relationships in videos. To address this, we present a simple yet effective solution: Tree-based Spatiotemporal Token Merging (TSTM) for fine-grained spatiotemporal redundancy compression.
💡Motivation

In this work, we identify two key observations about spatiotemporal redundancy in videos:
- Temporal redundancy is not bound to fixed spatial locations. Semantically consistent elements in videos often shift in spatial position, scale, or appearance due to motion and scene dynamics, making rigid spatial correspondence across frames unreliable
- Spatial and temporal redundancy are inherently coupled. Redundant regions within a single frame frequently persist across multiple frames. Decoupled spatiotemporal redundancy compression overlooks the intrinsic spatiotemporal relationships, leading to suboptimal compression.
To achieve better spatiotemporal redundancy compression, we present a simple yet effective solution: Tree-based Spatiotemporal Token Merging (TSTM) for fine-grained spatiotemporal redundancy compression, complemented by the Attention and Diversity-based Token Selection (ADTS) module for informative token selection.
🌈Method

Illustration of FlashVID. We compress visual tokens by two synergistic modules.
- Attention and Diversity-based Token Selection (ADTS) prioritizes spatiotemporally informative tokens while ensuring feature diversity by solving a calibrated Max-Min Diversity Problem (MMDP);
- Tree-based Saptiotemporal Token Merging (TSTM) models redundancy by spatiotemporal redundancy trees, which effectively capture fine-grained video dynamics. Each redundancy tree will be aggregated into a single token representation.
📦Installation
In this project, we use uv for package management.
- Clone this repository and navigate to the FlashVID folder:
git clone https://github.com/Fanziyang-v/FlashVID.git
cd FlashVID
- Install the inference package:
uv sync
🚀Quickstart
FlashVID's code is easy to use and works out of the box. Just wrap the model with the flashvid() function. Currently, FlashVID supports LLaVA-OneVision, LLaVA-Video, Qwen2.5-VL, and Qwen3-VL.
from flashvid import flashvid
model = flashvid(
model,
retention_ratio=0.1,
alpha=0.7,
temporal_threshold=0.8,
)
📝Note: You can override the default parameters (e.g., retention ratio) in the flashvid() wrapper function.
Inference demos are provided in playground/. Here is an running example:
python playground/llava_ov_infer.py \
--video-path assets/Qgr4dcsY-60.mp4 \
--question "Describe the video in detail." \
--num-frames 32 \
--enable-flashvid
📊Evaluation
In this project, all the experiments are conducted using LMMs-Eval. We provide FlashVID evaluation scripts in scripts/, including LLaVA-OneVision, LLaVA-Video, Qwen2.5-VL, and Qwen3-VL. You can run the scripts to reproduce our experimental results:
bash scripts/llava_ov.sh
📝Note: It is extremely easy to integrate FlashVID into LMMs-Eval by adding specific parameters in __init__() and wrapping the loaded model with the flashvid() function. (See lmms_eval/models/simple/llava_onevision.py)
👏Acknowledgement
This project is built upon recent open-source works: FastV, VisionZip, PruneVID, FastVID, LLaVA-NeXT, Qwen2.5-VL/Qwen3-VL, LMMs-Eval. Thanks for their excellent work!
📜Citation
If you find this project useful in your research, please consider citing:
@inproceedings{fan2026flashvid,
title={FlashVID: Efficient Video Large Language Models via Training-free Tree-based Spatiotemporal Token Merging},
author={Fan, Ziyang and Chen, Keyu and Xing, Ruilong and Li, Yulin and Jiang, Li and Tian, Zhuotao},
booktitle={Proceedings of the 14th International Conference on Learning Representations},
year={2026}
}
⭐️Star History
Related Skills
qqbot-channel
352.9kQQ 频道管理技能。查询频道列表、子频道、成员、发帖、公告、日程等操作。使用 qqbot_channel_api 工具代理 QQ 开放平台 HTTP 接口,自动处理 Token 鉴权。当用户需要查看频道、管理子频道、查询成员、发布帖子/公告/日程时使用。
docs-writer
100.7k`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie
model-usage
352.9kUse CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.
arscontexta
3.1kClaude Code plugin that generates individualized knowledge systems from conversation. You describe how you think and work, have a conversation and get a complete second brain as markdown files you own.
