Flexselect
The official repository for paper "FlexSelect: Flexible Token Selection for Efficient Long Video Understanding".
Install / Use
/learn @yunzhuzhang0918/FlexselectREADME
FlexSelect: Flexible Token Selection for Efficient Long Video Understanding (NeurIPS 2025)
Created by Yunzhu Zhang*, Yu Lu*, Tianyi Wang, Fengyun Rao, Yi Yang, Linchao Zhu
The official repository for paper "FlexSelect: Flexible Token Selection for Efficient Long Video Understanding".
Webpage | Paper | Huggingface
News
[2025-09-18]: 🎉 Our paper FlexSelect has been accepted by NeurIPS 2025!
[2025-6-01]: Source Code uploaded.
[2025-5-20]: Code repository created.
Introduction

We present FlexSelect, a flexible and efficient token selection method that leverages cross-modal attention scores in VideoLLMs to identify query-relevant visual tokens. Our approach combines: (1) training-free attention-based token ranking, and (2) a lightweight selector for fast filtering.
Performance
We conduct experiments on three video LLMs (LLaVA-video, Qwen2.5VL, InternVL2.5) under for benchmarks: LongVideoBench, VideoMME, LVbench, MLVU.
| Model | Size | VideoMME (Long) | VideoMME (Overall) | MLVU (M-Avg) | LongVB (Val) | LVBench (Test) | |-------|------|-----------------|--------------------|--------------|--------------|----------------| Proprietary Models | GPT-4o | - | 65.3 | 71.9 | 64.6 | 66.7 | 34.7 | | Gemini-1.5-Pro | - | 67.4 | 75.0 | - | 64.0 | 33.1 | Open-Source VideoLLMs | mPLUG-Owl3 | 7B | 50.1 | 59.3 | 63.7 | 52.1 | 43.5 | | Qwen2-VL | 7B | 53.8 | 63.3 | 66.9 | 55.6 | 42.4 | | NVILA | 8B | 54.8 | 64.2 | 70.1 | 57.7 | - | | VideoLLaMA3 | 7B | - | 66.2 | 73.0 | 59.8 | 45.3 | | Aria | 8×3.5B | 58.8 | 67.6 | 70.6 | 65.3 | - | | Oryx-1.5 | 34B | 59.3 | 67.3 | 72.3 | 62.0 | 30.8 | | Video-XL-Pro | 3B | - | 60.0 | 70.6 | 56.7 | - | | SF-LLaVA-1.5 | 7B | - | 63.9 | 71.5 | 62.5 | 45.3 | | TPO | 7B | 55.4 | 65.6 | 71.1 | 60.1 | - | | Quato | 7B | 55.7 | 65.9 | 71.9 | 59.0 | - | | ViLAMP | 7B | 57.8 | 67.5 | 72.6 | 61.2 | 45.2 | | LLaVA-Video | 7B | 52.9 | 64.4 | 68.6 | 58.2 | 43.1 | | + FlexSelect | 7B | 59.8 (↑6.9) | 68.9 (↑4.5) | 73.2 (↑4.6) | 61.9 (↑3.7) | 52.9 (↑9.8) | | + FlexSelect-Lite | 7B | 58.3 (↑5.4) | 68.3 (↑3.9) | 71.8 (↑3.2) | 60.7 (↑2.5) | 52.2 (↑9.1) | | InternVL2.5 | 8B | 52.8 | 64.2 | 68.9 | 59.5 | 43.4 | | + FlexSelect | 8B | 58.1 (↑5.3) | 67.0 (↑2.8) | 71.9 (↑3.0) | 60.1 (↑0.6) | 49.7 (↑6.3) | | + FlexSelect-Lite | 8B | 57.9 (↑5.1) | 67.2 (↑3.0) | 71.9 (↑3.0) | 61.2 (↑1.7) | 49.9 (↑6.5) | | Qwen2.5-VL | 7B | 55.6 | 65.4 | 70.2 | 59.5 | 45.3 | | + FlexSelect | 7B | 59.3 (↑3.7) | 68.2 (↑2.8) | 72.5 (↑2.3) | 62.4 (↑2.9) | 51.2 (↑5.9) | | + FlexSelect-Lite | 7B | 58.6 (↑3.0) | 67.4 (↑2.0) | 70.3 (↑0.1) | 61.9 (↑2.4) | 50.0 (↑4.7) | | LLaVA-Video | 72B | 61.9 | 70.0 | 71.2 | 62.4 | 45.5 | | + FlexSelect | 72B | 66.1 (↑4.2) | 73.1 (↑3.1) | 76.0 (↑4.8) | 66.9 (↑4.5) | 55.5 (↑10.0) | | Qwen2.5 VL | 72B | 63.9 | 73.4 | 76.3 | 66.2 | 47.3 | | + FlexSelect | 72B | 66.9 (↑3.0) | 74.4 (↑1.0) | 76.6 (↑0.3) | 66.4 (↑0.2) | 56.6 (↑9.3) |
Benchmark Data Preparation
All four used benchmarks can be downloaded from huggingface website: LongVideoBench, VideoMME, MLVU, and LVBench.
Prepare Data For VideoMME
- Download the videos.
huggingface-cli download --repo-type dataset --resume-download lmms-lab/Video-MME --local-dir lmms-lab/Video-MME --local-dir-use-symlinks False
- Unzip the videos
cd lmms-lab/Video-MME
unzip 'videos_chunked_*.zip' -d videos/
- Move the data to eval directory
ln -s lmms-lab/Video-MME/videos flexselect/eval/data/videomme/data
ln -s lmms-lab/Video-MME/videomme/test-00000-of-00001.parquet flexselect/eval/data/videomme/test-00000-of-00001.parquet
Prepare Data For MLVU
- Download the videos.
huggingface-cli download --repo-type dataset --resume-download sy1998/MLVU_dev --local-dir sy1998/MLVU_dev --local-dir-use-symlinks False
- Unzip the videos
cd sy1998/MLVU_dev
unzip 'video_part_*.zip' -d videos/
- Move the data to eval directory
ln -s sy1998/MLVU_dev/videos flexselect/eval/data/mlvu_test/data
ln -s sy1998/MLVU_dev/mlvu/test-00000-of-00001.parquet flexselect/eval/data/mlvu_test/test-00000-of-00001.parquet
Prepare Data For LVbench
-
Download the videos and files. Follow instructions here for downloading videos:
LVBenchThe flexselect/eval/data/lvbench/test.jsonl is the test file that we have compiled and conforms to the lmms-eval supported format. -
Move or Link the videos dir under flexselect/eval/data/lvbench
-
We reorangnize the test files to support lmms eval evaluation. You can download it from
hereand move or link it underdata/lvbench/dir.
Prepare Data For LongVideoBench
- Download the videos.
huggingface-cli download --repo-type dataset --resume-download longvideobench/LongVideoBench --local-dir longvideobench/LongVideoBench --local-dir-use-symlinks False
- Untar the videos
cd longvideobench/LongVideoBench
cat videos.tar.part.* > videos.tar
tar -xvf videos_merged.tar -C videos
- Move the data to eval directory
ln -s longvideobench/LongVideoBench/videos flexselect/eval/data/longvideobench/data
ln -s longvideobench/LongVideoBench/test-00000-of-00001.parquet flexselect/eval/data/longvideobench/test-00000-of-00001.parquet
Pretrained Model
The pretrained model can be found in their respective repositories: LLaVA-Video-7B, LLaVA-Video-72B, InternVL2.5-8B, Qwen2.5VL-7B and Qwen2.5VL-72B.
Evaluation
FlexSelect works in two modes: training-free mode and lightweight mode. We evaluate them using LMMS-Eval. We follow the environment installation guideline of LMMS-EVAL. You can setup a environment by running:
sh setup.sh
You should download the token selector weights into flexselect/eval/models from huggingface:
huggingface-cli download --resume-download yunzhuyunzhu/flexselect_llava_video --local-dir flexselect/eval/models/flexselect_llava_video
huggingface-cli download --resume-download yunzhuyunzhu/flexselect_qwen2.5vl --local-dir flexselect/eval/models/flexselect_qwen2.5vl
huggingface-cli download --resume-download yunzhuyunzhu/flexselect_internvl2.5 --local-dir flexselect/eval/models/flexselect_internvl2.5
Then you can reproduce our results:
cd flexselect/eval
sh scripts/eval_llavavideo.sh
sh scripts/eval_internvl2_5.sh
sh scripts/eval_qwenvl2_5.sh
Here are explanations of variants in our eval scripts:
| Parameter | Type | Options / Notes | Default |
|---------------------------|------------|---------------------------------------------------------------------------------|----------|
| use_token_selector | boolean | - true: Enable FlexSelect token selection<br>- false: Disable (standard eval) | false |
| token_selector_path | string | - "self": Training-free mode<br>- "path/to/token selector model": Lightweight mode | "self" |
| token_selector_layer| integer | reference layer number(only effective in Training-free mode) | -1 |
| drop_func_name | string | ways to get semantic relevance score <br>- "token_selection": average on head and text dimension<br>- "token_selection_argmax": argmax on head and text dimension| "token_selection" |
| tkn_budget | integer | max selected tokens | 6720 |
Here are explanations of some commandline choice:
1. Model Selection (--model)
Specify the evaluation model with the following options:
| Value | Model Evaluated |
|----------------|-------------------------------|
| llava_vid | LLaVA-Video-7B |
| internvl2 | InternVL2.5 |
| qwen2_5_vl | Qwen2.5VL |
2. Task Selection (--tasks)
| Value | Task Name | Notes |
|------------------------|--------------------|------------------------------------|
| videomme | Video-MME | Standard video evaluation |
| mlvu_dev | MLVU | Multi-language video understanding |
| lvbench | LVBench | Short-video benchmark |
| longvideobench_val_v | LongVideoBench | Default variant (e.g., for LLaVA) |
| longvideobench_val_v_sub | LongVideoBench | InternVL series only (uses caption) |
token selector training
FlexSelect trains 0.5B token selector for LLaVA-Video-7B, Qwen2.5VL-7B and InternVL2.5-8B.
We follow the environment installation guideline of corresponding project to construct training environment:
- LLaVA-Video: https://github.com/LLaVA-VL/LLaVA-NeXT?tab=readme-ov-file#2-install-the-inference-package
- Qwen2.5VL: https://github.com/QwenLM/Qwen2.5-VL/blob/main/qwen-vl-finetune/README.md
- InternVL2.5: https://internvl.readthedocs.io/en/latest/internvl2.5/finetune.html
You shoul
Related Skills
qqbot-channel
350.8kQQ 频道管理技能。查询频道列表、子频道、成员、发帖、公告、日程等操作。使用 qqbot_channel_api 工具代理 QQ 开放平台 HTTP 接口,自动处理 Token 鉴权。当用户需要查看频道、管理子频道、查询成员、发布帖子/公告/日程时使用。
docs-writer
100.5k`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie
model-usage
350.8kUse CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.
ddd
Guía de Principios DDD para el Proyecto > 📚 Documento Complementario : Este documento define los principios y reglas de DDD. Para ver templates de código, ejemplos detallados y guías paso
