<h1 align="center">GlimpsePrune</h1> <p align="center"> <a href="README.md">English</a> | <a href="README_zh.md">简体中文</a> </p> <p align="center"> <p align="center"> <strong>A Dynamic Visual Token Pruning Framework for Large Vision-Language Models</strong> </p> <p align="center"> <a href='https://arxiv.org/abs/2508.01548'><img src='https://img.shields.io/badge/arXiv-2508.01548-red'></a> <a href='https://huggingface.co/collections/ashun989/glimpseprune-688d8826ef5bd09db6af145e'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-orange'></a> <a href='https://huggingface.co/papers/2508.01548'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Paper-yellow'></a> <a href="https://github.com/HVision-NKU/GlimpsePrune/blob/main/LICENSE"><img src="https://img.shields.io/badge/License-Apache_2.0-blue.svg" alt="License"></a> </p> <div align="center"> <img src="assets/case1.png" width="80%"> <img src="assets/case2.png" width="80%"> <br> <em>GlimpsePrune dynamically prunes a large number of irrelevant visual tokens before answering questions, reducing the model's inference overhead.</em> </div>

GlimpsePrune is a dynamic visual token pruning framework designed for Large Vision-Language Models (LVLMs). Through fast training on a small amount of data (e.g., less than 1 hour on 20K GQA data), GlimpsePrune enables Qwen2.5-VL-7B to prune an average of 92.6% of visual tokens before generating a response, while maintaining performance comparable to the original model.

For more technical details, please refer to our paper.

If you find our work inspiring or helpful, please give us a star ⭐. Thank you for your attention and support:

Table of Contents
✨ Key Features
🚀 News
🖼️ Framework Overview
📊 Performance Results
✅ Roadmap
🛠️ Installation
📦 Models and Data
- Model Download
- Data Preparation
▶️ How to Use
🙏 Acknowledgements
🖊️ Citation
📧 Contact Us

✨ Key Features

High Pruning Rate: Prunes over 90% of visual tokens on average with almost no performance loss, effectively reducing computational and memory overhead.
Robust Performance: Stable performance when processing high-resolution images and handling complex free-form VQA tasks.
Lightweight Training: Only a few extra parameters (Glimpse token and VIP) need to be trained, completed in less than 1 hour on a single A100 GPU.
Broad Compatibility: Supports single and multi-image inputs, is compatible with KV-Cache and Flash Attention 2, and provides a fair comparison benchmark with other mainstream visual compression methods.

🚀 News

2025.08.05: Paper are publicly released!
2025.08.03: Code and Models are publicly released!

🖼️ Framework Overview

The core idea of GlimpsePrune is to introduce a glimpse token and a lightweight Visual tokens Important Predictor (VIP) that can quickly identify and retain the visual regions most relevant to the text prompt, pruning the remaining redundant information.

The core code implementation is located in:

Qwen2.5-VL: transformers_gp/models/qwen2_5_vl/model_gp.py
LLaVA-1.5: llava_gp/model/language_model/llava_llama.py

📊 Performance Results

We evaluated GlimpsePrune on multiple VQA benchmarks. The results show that it achieves a high pruning rate while maintaining performance on par with the original model, outperforming other visual compression methods.

<p align="center"> <b>Free-form VQA Benchmarks</b><br> <img src="assets/freeform_results.png" width="90%"> </p> <p align="center"> <b>Short-form VQA Benchmarks</b><br> <img src="assets/shortform_results.png" width="90%"> </p> <p align="center"> <b>Efficiency comparsion on V* in free form (batch size=4)</b><br> <img src="assets/efficiency.png" width="90%"> </p>

✅ Roadmap

[x] Support for Qwen2.5-VL
[x] Support for single-image input
[x] Support for multi-image input
[x] Provide a local Gradio Demo
[x] Support for LLaVA-1.5
[x] Provide evaluation scripts for various visual token compression methods (PyramidDrop, VisionZip, etc.) on the free-form VQA
[x] Support for batch input (Batch Inference)

🛠️ Installation

Clone the repository

git clone https://github.com/HVision-NKU/GlimpsePrune.git
cd GlimpsePrune

Create an environment and install dependencies We recommend create seperated virtual environment for different models:

For Qwen2.5-VL:
- python=3.10
- torch==2.7.0
- flash-attn==2.7.4.post1
- pip install -r qwen_requirements.txt
- pip install qwen-vl-utils[decord]
For LLaVA-1.5 (Optional):
<details> <summary>Click to expand LLaVA dependency installation</summary>
- python=3.10
- torch==2.1.2
- flash-attn=2.7.3
- pip install -r llava_requirements.txt
</details>
Additional dependencies for Evaluation and Demo (Optional):
```
# Evaluation
pip install lmms-eval==0.3.5 vllm==0.9.0.1
# Demo
pip install gradio==5.39.0
```

📦 Models and Data

Model Download

All models can be automatically downloaded from the Hugging Face Hub. If you encounter network issues, you can download them manually to a local directory. <new_module> are the weights of the extra glimpse token and VIP modules we trained.

Data Preparation

Training and Free-form VQA evaluation use the Visual-CoT dataset.

# Download the dataset (approx. 128GB)
huggingface-cli download --repo-type dataset --local-dir datas deepcs233/Visual-CoT cot_images_tar_split

# Extract
cd datas/cot_images_tar_split
cat cot_images_* | tar -xvf - -C ../cot
cd ../.. # Return to the project root directory

After extraction, the datas directory structure should be as follows:

GlimpsePrune/
├── datas/
│   └── cot/
│       ├── cub/
│       ├── gqa/
│       └── ...
└── ...

▶️ How to Use

Local Demo

We provide a Gradio Demo to intuitively experience the effects of GlimpsePrune.

python demo_gp.py \
    --base_model Qwen/Qwen2.5-VL-7B-Instruct \
    --new_modules_dir ashun989/GlimpsePrune_Qwen2.5-VL-7B-Instruct

Inference

For a detailed example of how to load the model and perform inference, please refer to the Jupyter Notebook: ➡️ notebook/gp_qwen_tutorial.ipynb

Evaluation

We provide convenient evaluation scripts.

Free-form VQA

# Default settings (no retention rate limit)
BASE_MODEL=<base_model> bash scripts/infer_qwen_gp_cot.sh <new_modules_dir>

# Set a maximum retention rate (e.g., 11.1%)
BASE_MODEL=<base_model> MAX_REMAIN_RATIO=0.111 bash scripts/infer_qwen_gp_cot.sh <new_modules_dir>

Short-form VQA

# Default settings
BASE_MODEL=<base_model> bash scripts/eval_qwen_gp.sh <new_modules_dir>

# Set a maximum retention rate
BASE_MODEL=<base_model> MAX_REMAIN_RATIO=0.111 bash scripts/eval_qwen_gp.sh <new_modules_dir>

Efficiency

# Download V* bench
hf download https://huggingface.co/datasets/craigwu/vstar_bench --repo-type dataset --local-dir datas/vstar_bench

# Test GlimpsePrune under 4096 visual tokens with 11.1% retention ratio. 
TASKS="vstar" BATCH_SIZE_PER_DEVICE=4 WARMUP_ITERS=3 TIME_LOGGER=1 MEMORY_LOGGER=1 FIXED_REMAIN_RATIO=0.111 MAX_PIXELS=3211264 BASE_MODEL=$base_model bash scripts/infer_q

GlimpsePrune

Install / Use

README

Table of Contents