SkillAgentSearch skills...

PruneVid

[ACL 2025] PruneVid: Visual Token Pruning for Efficient Video Large Language Models

Install / Use

/learn @Visual-AI/PruneVid
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

ACL 2025 | PruneVid

The official repository for paper "PruneVid: Visual Token Pruning for Efficient Video Large Language Models".

Xiaohu Huang, Hao Zhou, Kai Han

WebpagePaper

Introduction

Framework

We present PruneVid, a training-free visual token pruning method that enhances efficiency in multi-modal video understanding. By merging spatial-temporal tokens to reduce video redundancy and leveraging attention mechanisms within LLMs to retain only the visual tokens relevant to questions, PruneVid ensures high performance while reducing computational overhead.

Todo:

  • [x] Code release of PruneVid with PLLaVA.
  • [ ] Code release of PruneVid with LLaVA-OneVision.
  • [ ] Code release of PruneVid with ST-LLM.

License

PruneVid is released under the CC BY-NC-SA 4.0 license.

Performance

We conduct experiments on three video LLMs (PLLaVA, ST-LLM, and LLaVA-OneVision) under for benchmarks: MVBench, VideoMME, Egoschema, and VideoChatgpt-Bench (VCG-Bench).

| Method | Retained Ratio | FLOPs (×) | MVBench | VideoMME | EgoSchema Subset / Fullset | TU | CU | CO | DO | CI | Avg | |---------------------------------|----------------|-----------|---------|----------|---------------------------|------|------|------|------|------|------| | PLLaVA | 100.0% | 1.00× | 46.6 | 44.4 | 47.8 / 42.6 | 2.33 | 3.62 | 2.93 | 2.86 | 3.21 | 2.99 | | PLLaVA w/ FastV | 30.0% | 0.33× | 46.1 | 43.6 | 46.2 / 41.0 | 2.38 | 3.49 | 2.89 | 2.76 | 3.14 | 2.93 | | PLLaVA w/ Prumerge | 55.7% | 0.53× | 45.6 | 43.8 | 45.2 / 40.4 | 2.34 | 3.52 | 2.90 | 2.76 | 3.15 | 2.93 | | PLLaVA w/ Look-M | 20.0% | 1.00× | 46.6 | 44.3 | 47.0 / 42.3 | 2.28 | 3.41 | 2.75 | 2.65 | 3.00 | 2.82 | | PLLaVA w/ Ours | 16.2% | 0.23× | 47.6| 45.3 | 49.0 / 42.6 | 2.44 | 3.51 | 2.99 | 2.78 | 3.20 | 2.98 | | | | | | | | | | | | | | | ST-LLM | 100.0% | 1.00× | 54.9 | 42.0 | 56.2 / 45.6 | 2.46 | 3.46 | 2.66 | 2.63 | 3.08 | 2.86 | | ST-LLM w/ FastV | 30.0% | 0.37× | 42.9 | 34.5 | 48.0 / 38.5 | 2.01 | 2.23 | 1.55 | 1.94 | 1.69 | 1.88 | | ST-LLM w/ Look-M | 20.0% | 1.00× | 54.0 | 40.6 | 54.0 / 44.5 | 2.35 | 3.41 | 2.60 | 2.51 | 3.01 | 2.78 | | ST-LLM w/ Ours | 15.1% | 0.26× | 54.3| 41.4 | 54.6 / 44.7 | 2.40 | 3.43 | 2.63 | 2.60 | 3.04 | 2.82 | | | | | | | | | | | | | | | LLaVA-OneVision | 100.0% | 1.00× | 58.0 | 58.2 | 62.0 / 60.0 | 2.75 | 3.70 | 3.39 | 2.97 | 3.50 | 3.26 | | LLaVA-OneVision w/ FastV | 30.0% | 0.30× | 57.2 | 57.6 | 62.6 / 60.0 | 2.65 | 3.61 | 3.28 | 2.85 | 3.39 | 3.16 | | LLaVA-OneVision w/ Prumerge | 55.2% | 0.49× | 52.9 | 56.7 | 62.2 / 60.0 | 2.72 | 3.64 | 3.32 | 2.94 | 3.44 | 3.21 | | LLaVA-OneVision w/ Look-M | 20.0% | 1.00× | 57.0 | 58.0 | 62.0 / 59.8 | 2.71 | 3.70 | 3.29 | 2.89 | 3.44 | 3.21 | | LLaVA-OneVision w/ Ours | 17.0% | 0.20× | 57.5| 58.6 | 62.6 / 59.5 | 2.73 | 3.72 | 3.28 | 2.94 | 3.51 | 3.24 |

Data Preparation

All four used benchmarks can be downloaded from huggingface website: MVBench, VideoMME, Egoschema, and VideoChatGPT-Bench.

After downloading the datasets, please put them into the DATAS folder and sort out the source videos and annotations in the following formats:

DATAS/
├── ego_schema/
│   ├── json/
│   └── videos/
├── MVBench/
│   ├── json/
│   └── video/
├── VCGBench/
│   ├── Videos/
│   ├── Zero_Shot_QA/
└── Video-MME/
    ├── data/
    └── json/

Pretrained Model

The pretrained model can be found in their respective repositories: PLLaVA, ST-LLM, and LLaVA-OneVision.

After downloading the models please put them into the MODELS folder:

MODELS/
├── pllava-7b/

Environment Install

We follow the environment installation guideline of PLLaVA.

  1. Above all, the following environment set up is for python 3.10. If you choose to use conda for environment setup, we recommend creating the virtual environment with:
conda create -n pllava python=3.10
  1. Firstly, install pytorch from the official website. The code runs on torch 2.2.1, cu118 or cu122. Select the version that suits your drive version.
torch                       2.2.1+cu118
torchaudio                  2.2.1+cu118
torchvision                 0.17.1+cu118

If your driver version is higher than cu121, you could probably try installing with the following scripts:

pip install -r requirements.txt

Otherwise, you would need to install a torch for your server first, then install the other packages:

pip install -r requirements.torch.txt # decide your own requirements, (this is for cu11), or install torch directly following the official website.
pip install -r requirements.no_torch.txt # install the following

Evaluation

As PruneVid is a training-free method, we can directly apply it on the pre-trained models.

The provided scripts for evaluating model performance is given in scripts/eval.sh. Below is the script for evaluating the performance on MVBench, where you can edit the hyper-parameters whatever you want. The default setting is used in our paper.

lora_alpha=14
selected_layers=(10)
alphas=(0.4)
taus=(0.8)
temporal_segment_ratios=(0.25)
cluster_ratios=(0.5)

for alpha in "${alphas[@]}"; do
  for selected_layer in "${selected_layers[@]}"; do
    for tau in "${taus[@]}"; do
      for temporal_segment_ratio in "${temporal_segment_ratios[@]}"; do
        for cluster_ratio in "${cluster_ratios[@]}"; do
          # 执行命令
          SAVE_DIR=test_results/pllava-7b-lora${lora_alpha}-threshold${tau}-layer${selected_layer}-alpha${alpha}-temporal-segment-ratio-${temporal_segment_ratio}-cluster-ratio-${cluster_ratio}
          mkdir -p "${SAVE_DIR}"
          conv_mode=eval_mvbench
          python -m tasks.eval.mvbench.pllava_eval_mvbench \
              --pretrained_model_name_or_path ${model_dir} \
              --save_path ${SAVE_DIR}/mvbench \
              --num_frames ${num_frames} \
              --use_lora \
              --lora_alpha ${lora_alpha} \
              --top_p 1.0 \
              --temperature 1.0 \
              --weight_dir ${weight_dir} \
              --pooling_shape 16-12-12 \
              --conv_mode ${conv_mode} \
              --selected_layer ${selected_layer} \
              --alpha ${alpha} \
              --tau ${tau} \
              --temporal_segment_ratio ${temporal_segment_ratio} \
              --cluster_ratio ${cluster_ratio}
        done
      done
    done
  done
done

As for Egoschema, which needs an external service to evaluate the model performance, we run the evaluate_egoschema_result.py for evaluation. Before executing the file, you should change the root_dir variable to your folder.

python evaluate_egoschema_result.py

Acknowledgement

This repository is built upon PLLaVA, ST-LLM, and LLaVA-OneVision. Thanks for those well-organized codebases.

Citation

@inproceedings{
  huang2024prunevid,
  title={PruneVid: Visual Token Pruning for Efficient Video Large Language Models},
  author={Xiaohu Huang and Hao Zhou and Kai Han},
  booktitle={arXiv},
  year={2024}
}
View on GitHub
GitHub Stars70
CategoryContent
Updated1d ago
Forks0

Languages

Python

Security Score

80/100

Audited on Mar 30, 2026

No findings