GlimpsePrune
Official repository of the paper "A Glimpse to Compress: Dynamic Visual Token Pruning for Large Vision-Language Models"
Install / Use
/learn @HVision-NKU/GlimpsePruneREADME
GlimpsePrune is a dynamic visual token pruning framework designed for Large Vision-Language Models (LVLMs). Through fast training on a small amount of data (e.g., less than 1 hour on 20K GQA data), GlimpsePrune enables Qwen2.5-VL-7B to prune an average of 92.6% of visual tokens before generating a response, while maintaining performance comparable to the original model.
For more technical details, please refer to our paper.
If you find our work inspiring or helpful, please give us a star ⭐. Thank you for your attention and support:
<p align="center"> <a href="https://github.com/HVision-NKU/GlimpsePrune/stargazers"> <img src="https://reporoster.com/stars/HVision-NKU/GlimpsePrune" alt="Stargazers repo roster for @HVision-NKU/GlimpsePrune" width="80%"> </a> </p>Table of Contents
- Table of Contents
- ✨ Key Features
- 🚀 News
- 🖼️ Framework Overview
- 📊 Performance Results
- ✅ Roadmap
- 🛠️ Installation
- 📦 Models and Data
- ▶️ How to Use
- 🙏 Acknowledgements
- 🖊️ Citation
- 📧 Contact Us
✨ Key Features
- High Pruning Rate: Prunes over 90% of visual tokens on average with almost no performance loss, effectively reducing computational and memory overhead.
- Robust Performance: Stable performance when processing high-resolution images and handling complex free-form VQA tasks.
- Lightweight Training: Only a few extra parameters (Glimpse token and VIP) need to be trained, completed in less than 1 hour on a single A100 GPU.
- Broad Compatibility: Supports single and multi-image inputs, is compatible with KV-Cache and Flash Attention 2, and provides a fair comparison benchmark with other mainstream visual compression methods.
🚀 News
🖼️ Framework Overview
The core idea of GlimpsePrune is to introduce a glimpse token and a lightweight Visual tokens Important Predictor (VIP) that can quickly identify and retain the visual regions most relevant to the text prompt, pruning the remaining redundant information.
<div align="center"> <img src="assets/framework.png" width="70%"> </div>The core code implementation is located in:
- Qwen2.5-VL:
transformers_gp/models/qwen2_5_vl/model_gp.py - LLaVA-1.5:
llava_gp/model/language_model/llava_llama.py
📊 Performance Results
We evaluated GlimpsePrune on multiple VQA benchmarks. The results show that it achieves a high pruning rate while maintaining performance on par with the original model, outperforming other visual compression methods.
<p align="center"> <b>Free-form VQA Benchmarks</b><br> <img src="assets/freeform_results.png" width="90%"> </p> <p align="center"> <b>Short-form VQA Benchmarks</b><br> <img src="assets/shortform_results.png" width="90%"> </p> <p align="center"> <b>Efficiency comparsion on V* in free form (batch size=4)</b><br> <img src="assets/efficiency.png" width="90%"> </p>✅ Roadmap
- [x] Support for Qwen2.5-VL
- [x] Support for single-image input
- [x] Support for multi-image input
- [x] Provide a local Gradio Demo
- [x] Support for LLaVA-1.5
- [x] Provide evaluation scripts for various visual token compression methods (PyramidDrop, VisionZip, etc.) on the free-form VQA
- [x] Support for batch input (Batch Inference)
🛠️ Installation
-
Clone the repository
git clone https://github.com/HVision-NKU/GlimpsePrune.git cd GlimpsePrune -
Create an environment and install dependencies We recommend create seperated virtual environment for different models:
For Qwen2.5-VL:
python=3.10torch==2.7.0flash-attn==2.7.4.post1pip install -r qwen_requirements.txtpip install qwen-vl-utils[decord]
For LLaVA-1.5 (Optional):
<details> <summary>Click to expand LLaVA dependency installation</summary>python=3.10torch==2.1.2flash-attn=2.7.3pip install -r llava_requirements.txt
Additional dependencies for Evaluation and Demo (Optional):
# Evaluation pip install lmms-eval==0.3.5 vllm==0.9.0.1 # Demo pip install gradio==5.39.0
📦 Models and Data
Model Download
All models can be automatically downloaded from the Hugging Face Hub. If you encounter network issues, you can download them manually to a local directory. <new_module> are the weights of the extra glimpse token and VIP modules we trained.
|<base_model>| <new_module> |
|:---:|:---:|
|Qwen/Qwen2.5-VL-3B-Instruct|ashun989/GlimpsePrune_Qwen2.5-VL-3B-Instruct|
|Qwen/Qwen2.5-VL-7B-Instruct|ashun989/GlimpsePrune_Qwen2.5-VL-7B-Instruct|
|liuhaotian/llava-v1.5-7b|ashun989/GlimpsePrune_LLaVA-1.5-7B|
|liuhaotian/llava-v1.5-13b|ashun989/GlimpsePrune_LLaVA-1.5-13B|
Data Preparation
Training and Free-form VQA evaluation use the Visual-CoT dataset.
# Download the dataset (approx. 128GB)
huggingface-cli download --repo-type dataset --local-dir datas deepcs233/Visual-CoT cot_images_tar_split
# Extract
cd datas/cot_images_tar_split
cat cot_images_* | tar -xvf - -C ../cot
cd ../.. # Return to the project root directory
After extraction, the datas directory structure should be as follows:
GlimpsePrune/
├── datas/
│ └── cot/
│ ├── cub/
│ ├── gqa/
│ └── ...
└── ...
▶️ How to Use
Local Demo
We provide a Gradio Demo to intuitively experience the effects of GlimpsePrune.
python demo_gp.py \
--base_model Qwen/Qwen2.5-VL-7B-Instruct \
--new_modules_dir ashun989/GlimpsePrune_Qwen2.5-VL-7B-Instruct
Inference
For a detailed example of how to load the model and perform inference, please refer to the Jupyter Notebook:
➡️ notebook/gp_qwen_tutorial.ipynb
Evaluation
We provide convenient evaluation scripts.
Free-form VQA
# Default settings (no retention rate limit)
BASE_MODEL=<base_model> bash scripts/infer_qwen_gp_cot.sh <new_modules_dir>
# Set a maximum retention rate (e.g., 11.1%)
BASE_MODEL=<base_model> MAX_REMAIN_RATIO=0.111 bash scripts/infer_qwen_gp_cot.sh <new_modules_dir>
Short-form VQA
# Default settings
BASE_MODEL=<base_model> bash scripts/eval_qwen_gp.sh <new_modules_dir>
# Set a maximum retention rate
BASE_MODEL=<base_model> MAX_REMAIN_RATIO=0.111 bash scripts/eval_qwen_gp.sh <new_modules_dir>
Efficiency
# Download V* bench
hf download https://huggingface.co/datasets/craigwu/vstar_bench --repo-type dataset --local-dir datas/vstar_bench
# Test GlimpsePrune under 4096 visual tokens with 11.1% retention ratio.
TASKS="vstar" BATCH_SIZE_PER_DEVICE=4 WARMUP_ITERS=3 TIME_LOGGER=1 MEMORY_LOGGER=1 FIXED_REMAIN_RATIO=0.111 MAX_PIXELS=3211264 BASE_MODEL=$base_model bash scripts/infer_q
