DART
[EMNLP 2025 main π₯] Code for "Stop Looking for Important Tokens in Multimodal Language Models: Duplication Matters More"
Install / Use
/learn @ZichenWen1/DARTREADME
Zichen Wen<sup>1,2</sup>, Yifeng Gao<sup>1</sup>, Shaobo Wang<sup>1</sup>, Junyuan Zhang<sup>2</sup>, Qintong Zhang<sup>2,4</sup>, <br> Weijia Li<sup>3,2</sup>, Conghui He<sup>2β</sup>, Linfeng Zhang<sup>1β</sup>,
<sup>1</sup>Shanghai Jiao Tong University, <sup>2</sup>Shanghai AI Laboratory, <br> <sup>3</sup>Sun Yat-sen University, <sup>4</sup>Peking University
</h4> <div align="center"> </div>π₯ News
2025.10.13π€π€ We have released our latest work EPIC, an efficient framework for progressive consistency distillation in multimodal large language models!2025.10.10π€π€ We've released our latest work, VTC-Bench. Come test whether your token compression method really works!2025.08.30π€π€ We have seamlessly integrated DART into Qwen2.5-VL.2025.08.21π€π€ Our DART is accepted at EMNLP'25 main!2025.05.15π€π€ Our analytical work on token compression has been accepted as ACL'25 Finding!2025.03.19π€π€ The implementation and evaluation scripts for LLaVA-Next are now available2025.03.18π€π€ We have released the implementation of DART for Qwen2-VL, and now you can easily evaluate it using lmms-eval!2025.02.22π€π€ We release our latest work DART, a plug-and-play, training-free token reduction method that seamlessly integrates with efficient attention operators. Code is available!
π Overview
<p align='center'> <img src='https://github.com/ZichenWen1/DART/blob/main/images/overview.png' alt='mask' width='1000px'> </p>TLDR: We propose DART (Duplication-Aware Reduction of Tokens), a training-free method that prunes vision tokens based on duplication, achieving 88.9% token reduction and 1.99 speed-up while maintaining performance and compatibility with efficient attention operators.
π Preparation
LLaVA
- Clone this repository.
git clone https://github.com/ZichenWen1/DART
cd DART
- Environment Setup and Preparation
conda create -n DART python=3.10 -y
conda activate DART
pip install -e .
pip install flash-attn --no-build-isolation
- Download Multimodal Benchmark
Please follow the detailed instruction in LLaVA-Evaluation.
Qwen2-VL
conda create -n DART_Qwen2VL python=3.10 -y
conda activate DART_Qwen2VL
cd Qwen2-VL/transformers && pip install -e .
pip install accelerate qwen-vl-utils[decord]
pip install flash-attn --no-build-isolation
cd ../../lmms-eval && pip install -e .
Qwen2.5-VL
pip install -U transformers==4.55.4
π― Usage
LLaVA
π Script Templates
bash scripts/v1_5/eval/[Benchmark].sh [Reduction_Ratio] [Max_Num_Trunction]
π Examples
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/textvqa.sh 0.778 128
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/pope.sh 0.778 128
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/mme.sh 0.778 128
Qwen2-VL
π Examples
cd Qwen2-VL
bash eval_scripts/lmms_eval.sh True [Reduction_Ratio]
Qwen2.5-VL
π Examples
cd Qwen2_5-VL
bash eval_scripts/lmms_eval.sh True [Reduction_Ratio]
π License
This project is released under the Apache 2.0 license.
π Citation
Please consider citing our paper in your publications, if our findings help your research.
@article{wen2025stop,
title={Stop Looking for Important Tokens in Multimodal Language Models: Duplication Matters More},
author={Wen, Zichen and Gao, Yifeng and Wang, Shaobo and Zhang, Junyuan and Zhang, Qintong and Li, Weijia and He, Conghui and Zhang, Linfeng},
journal={arXiv preprint arXiv:2502.11494},
year={2025}
}
@article{wen2025token,
title={Token Pruning in Multimodal Large Language Models: Are We Solving the Right Problem?},
author={Wen, Zichen and Gao, Yifeng and Li, Weijia and He, Conghui and Zhang, Linfeng},
journal={arXiv preprint arXiv:2502.11501},
year={2025}
}
π Acknowledgment
We extend our gratitude to the open-source efforts of LLaVA, Qwen2-VL, and lmms-eval.
π© Contact
For any questions about our paper or code, please email zichen.wen@outlook.com.
