SkillAgentSearch skills...

GlimpsePrune

Official repository of the paper "A Glimpse to Compress: Dynamic Visual Token Pruning for Large Vision-Language Models"

Install / Use

/learn @HVision-NKU/GlimpsePrune
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<h1 align="center">GlimpsePrune</h1> <p align="center"> <a href="README.md">English</a> | <a href="README_zh.md">简体中文</a> </p> <p align="center"> <p align="center"> <strong>A Dynamic Visual Token Pruning Framework for Large Vision-Language Models</strong> </p> <p align="center"> <a href='https://arxiv.org/abs/2508.01548'><img src='https://img.shields.io/badge/arXiv-2508.01548-red'></a> <a href='https://huggingface.co/collections/ashun989/glimpseprune-688d8826ef5bd09db6af145e'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-orange'></a> <a href='https://huggingface.co/papers/2508.01548'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Paper-yellow'></a> <a href="https://github.com/HVision-NKU/GlimpsePrune/blob/main/LICENSE"><img src="https://img.shields.io/badge/License-Apache_2.0-blue.svg" alt="License"></a> </p> <div align="center"> <img src="assets/case1.png" width="80%"> <img src="assets/case2.png" width="80%"> <br> <em>GlimpsePrune dynamically prunes a large number of irrelevant visual tokens before answering questions, reducing the model's inference overhead.</em> </div>

GlimpsePrune is a dynamic visual token pruning framework designed for Large Vision-Language Models (LVLMs). Through fast training on a small amount of data (e.g., less than 1 hour on 20K GQA data), GlimpsePrune enables Qwen2.5-VL-7B to prune an average of 92.6% of visual tokens before generating a response, while maintaining performance comparable to the original model.

For more technical details, please refer to our paper.

If you find our work inspiring or helpful, please give us a star ⭐. Thank you for your attention and support:

<p align="center"> <a href="https://github.com/HVision-NKU/GlimpsePrune/stargazers"> <img src="https://reporoster.com/stars/HVision-NKU/GlimpsePrune" alt="Stargazers repo roster for @HVision-NKU/GlimpsePrune" width="80%"> </a> </p>

Table of Contents

✨ Key Features

  • High Pruning Rate: Prunes over 90% of visual tokens on average with almost no performance loss, effectively reducing computational and memory overhead.
  • Robust Performance: Stable performance when processing high-resolution images and handling complex free-form VQA tasks.
  • Lightweight Training: Only a few extra parameters (Glimpse token and VIP) need to be trained, completed in less than 1 hour on a single A100 GPU.
  • Broad Compatibility: Supports single and multi-image inputs, is compatible with KV-Cache and Flash Attention 2, and provides a fair comparison benchmark with other mainstream visual compression methods.

🚀 News

  • 2025.08.05: Paper are publicly released!
  • 2025.08.03: Code and Models are publicly released!

🖼️ Framework Overview

The core idea of GlimpsePrune is to introduce a glimpse token and a lightweight Visual tokens Important Predictor (VIP) that can quickly identify and retain the visual regions most relevant to the text prompt, pruning the remaining redundant information.

<div align="center"> <img src="assets/framework.png" width="70%"> </div>

The core code implementation is located in:

📊 Performance Results

We evaluated GlimpsePrune on multiple VQA benchmarks. The results show that it achieves a high pruning rate while maintaining performance on par with the original model, outperforming other visual compression methods.

<p align="center"> <b>Free-form VQA Benchmarks</b><br> <img src="assets/freeform_results.png" width="90%"> </p> <p align="center"> <b>Short-form VQA Benchmarks</b><br> <img src="assets/shortform_results.png" width="90%"> </p> <p align="center"> <b>Efficiency comparsion on V* in free form (batch size=4)</b><br> <img src="assets/efficiency.png" width="90%"> </p>

✅ Roadmap

  • [x] Support for Qwen2.5-VL
  • [x] Support for single-image input
  • [x] Support for multi-image input
  • [x] Provide a local Gradio Demo
  • [x] Support for LLaVA-1.5
  • [x] Provide evaluation scripts for various visual token compression methods (PyramidDrop, VisionZip, etc.) on the free-form VQA
  • [x] Support for batch input (Batch Inference)

🛠️ Installation

  1. Clone the repository

    git clone https://github.com/HVision-NKU/GlimpsePrune.git
    cd GlimpsePrune
    
  2. Create an environment and install dependencies We recommend create seperated virtual environment for different models:

    For Qwen2.5-VL:

    For LLaVA-1.5 (Optional):

    <details> <summary>Click to expand LLaVA dependency installation</summary> </details>

    Additional dependencies for Evaluation and Demo (Optional):

    # Evaluation
    pip install lmms-eval==0.3.5 vllm==0.9.0.1
    # Demo
    pip install gradio==5.39.0
    

📦 Models and Data

Model Download

All models can be automatically downloaded from the Hugging Face Hub. If you encounter network issues, you can download them manually to a local directory. <new_module> are the weights of the extra glimpse token and VIP modules we trained.

|<base_model>| <new_module> | |:---:|:---:| |Qwen/Qwen2.5-VL-3B-Instruct|ashun989/GlimpsePrune_Qwen2.5-VL-3B-Instruct| |Qwen/Qwen2.5-VL-7B-Instruct|ashun989/GlimpsePrune_Qwen2.5-VL-7B-Instruct| |liuhaotian/llava-v1.5-7b|ashun989/GlimpsePrune_LLaVA-1.5-7B| |liuhaotian/llava-v1.5-13b|ashun989/GlimpsePrune_LLaVA-1.5-13B|

Data Preparation

Training and Free-form VQA evaluation use the Visual-CoT dataset.

# Download the dataset (approx. 128GB)
huggingface-cli download --repo-type dataset --local-dir datas deepcs233/Visual-CoT cot_images_tar_split

# Extract
cd datas/cot_images_tar_split
cat cot_images_* | tar -xvf - -C ../cot
cd ../.. # Return to the project root directory

After extraction, the datas directory structure should be as follows:

GlimpsePrune/
├── datas/
│   └── cot/
│       ├── cub/
│       ├── gqa/
│       └── ...
└── ...

▶️ How to Use

Local Demo

We provide a Gradio Demo to intuitively experience the effects of GlimpsePrune.

python demo_gp.py \
    --base_model Qwen/Qwen2.5-VL-7B-Instruct \
    --new_modules_dir ashun989/GlimpsePrune_Qwen2.5-VL-7B-Instruct

Inference

For a detailed example of how to load the model and perform inference, please refer to the Jupyter Notebook: ➡️ notebook/gp_qwen_tutorial.ipynb

Evaluation

We provide convenient evaluation scripts.

Free-form VQA

# Default settings (no retention rate limit)
BASE_MODEL=<base_model> bash scripts/infer_qwen_gp_cot.sh <new_modules_dir>

# Set a maximum retention rate (e.g., 11.1%)
BASE_MODEL=<base_model> MAX_REMAIN_RATIO=0.111 bash scripts/infer_qwen_gp_cot.sh <new_modules_dir>

Short-form VQA

# Default settings
BASE_MODEL=<base_model> bash scripts/eval_qwen_gp.sh <new_modules_dir>

# Set a maximum retention rate
BASE_MODEL=<base_model> MAX_REMAIN_RATIO=0.111 bash scripts/eval_qwen_gp.sh <new_modules_dir>

Efficiency

# Download V* bench
hf download https://huggingface.co/datasets/craigwu/vstar_bench --repo-type dataset --local-dir datas/vstar_bench

# Test GlimpsePrune under 4096 visual tokens with 11.1% retention ratio. 
TASKS="vstar" BATCH_SIZE_PER_DEVICE=4 WARMUP_ITERS=3 TIME_LOGGER=1 MEMORY_LOGGER=1 FIXED_REMAIN_RATIO=0.111 MAX_PIXELS=3211264 BASE_MODEL=$base_model bash scripts/infer_q
View on GitHub
GitHub Stars92
CategoryDevelopment
Updated22d ago
Forks1

Languages

Python

Security Score

100/100

Audited on Mar 9, 2026

No findings