MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer

<p align="center"> <a href="https://arxiv.org/pdf/2403.02991.pdf" target="_blank">[Paper]</a> <a href="https://arxiv.org/abs/2403.02991" target="_blank">[ArXiv]</a> <a href="https://github.com/double125/MADTP" target="_blank">[Code]</a> <img src="MADTP.png" width="800">

Official implementation of MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer.

What's New 🥳

(SEP 6, 2024), we released the implementation and scripts of MADTP. (Note that checkpoints and logs will come soon.)[Code] 🚩
(Feb 27, 2024), MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer was accepted by CVPR 2024. [Paper] [ArXiv]. 🎉

Installation

The code is tested on Pytorch==1.11.0, cuda==11.3.1, and python==3.8.13. The dependencies can be installed by:

conda env create -f environment.yml

Supported Tasks, Models, and Datasets

Type | Supported Tasks | Supported Models | Supported Datasets | --- | --- | :---: | :---: Multi-modal | Visual Reasoning | BLIP (instructions) | NLVR2 Multi-modal |Image Caption | BLIP (instructions) | COCO Caption Multi-modal |Visual Question Answer | BLIP (instructions) | VQAv2 Multi-modal |Image-Text Retrieval | CLIP (instructions), BLIP (instructions) | COCO, Flickr30k Multi-modal |Text-Image Retrieval | CLIP (instructions), BLIP (instructions) | COCO, Flickr30k

Visual Reasoning on the NLVR2 Dataset

Dataset & Annotation

Download the NLVR2 dataset, unzip it under the datasets folder, and accordingly modify the image_root in config. Download all-in-one annotations (including annotations for Visual Reasoning, Image Caption, VQA, Image-Text Retrieval, and Text-Image Retrieval tasks) from this link, unzip it under the annotation folder, and accordingly modify the annotation in config. See here for expected folder structres.

Evaluation

Download compressed checkpoints from the table below, put them under the output folder, and accordingly modify the --pretrained of the scripts. For example, to evaluate a compressed model with 0.5 reduce ratio:

python -m torch.distributed.run --nproc_per_node=8 compress_nlvr.py --evaluate \
--pretrained output/nlvr_nlvr2_compression_p0.5/model_base_nlvr_nlvr2_p0.5_compressed.pth \
--config ./configs/nlvr.yaml \
--output_dir output/nlvr_nlvr2_compression_p0.5

Compression

Download the uncompressed model from the table below, put it under the pretrained folder, and accordingly modify the pretrained in config. For example, to conduct a compression at 0.5 reduce ratio on 8 A100 GPUs (80G):
```
python -m torch.distributed.run --nproc_per_node=8 compress_nlvr_dtp.py --p 0.5 --epoch 15 \
--pretrained pretrained/model_base_nlvr.pth \
--config ./configs/nlvr.yaml \
--output_dir output/nlvr_nlvr2_compression_p0.5
```
Resources

Reduction | Uncompressed Model | Compression Script | Training Log | Compressed Checkpoint | Evaluation Script --- | :---: | :---: | :---: | :---: | :---: 0.3 | <a href="https://drive.google.com/uc?export=download&id=1pcsvlNRzzoq_q6Kaku_Kkg1MFELGoIxE">Download</a> | Link | <a href="https://drive.google.com/file/d/1aqiY86op26ceuWp6SFu1kaScqDnAIl1G/view?usp=drive_link">Download</a> | <a href="https://drive.google.com/file/d/1foe-c6qU97QGEz7kNC9OsGJ8OXk7OmQT/view?usp=drive_link">Download</a> | Link 0.5 | <a href="https://drive.google.com/uc?export=download&id=1pcsvlNRzzoq_q6Kaku_Kkg1MFELGoIxE">Download</a> | Link | <a href="https://drive.google.com/file/d/1JyYypUDbZVD00ep5SSnQEc6LnOEL-ODT/view?usp=drive_link">Download</a> | <a href="https://drive.google.com/file/d/1R_TgQKlHv6Y6Fh5_ny4fRKNLAva75Frs/view?usp=drive_link">Download</a> | Link 0.6 | <a href="https://drive.google.com/uc?export=download&id=1pcsvlNRzzoq_q6Kaku_Kkg1MFELGoIxE">Download</a> | Link| <a href="https://drive.google.com/file/d/1YB8xJee2R7B5PSjzLEJBjmQkBs5XAfIe/view?usp=drive_link">Download</a> | <a href="https://drive.google.com/file/d/1Sg_agxwV04o13d6XnJLblGby5cedtngT/view?usp=drive_link">Download</a> | Link 0.7 | <a href="https://drive.google.com/uc?export=download&id=1pcsvlNRzzoq_q6Kaku_Kkg1MFELGoIxE">Download</a> | Link| <a href="https://drive.google.com/file/d/11DbcbzsCjA7mH5gbJQrtrHapobIz12n-/view?usp=drive_link">Download</a> | <a href="https://drive.google.com/file/d/1qcZf5YOl1aDW8S5OEDsIH6lZN4z2UgI8/view?usp=drive_link">Download</a> | Link 0.8 | <a href="https://drive.google.com/uc?export=download&id=1pcsvlNRzzoq_q6Kaku_Kkg1MFELGoIxE">Download</a> | Link | <a href="https://drive.google.com/file/d/16K2WIslVVoAzqmMcwvoBWI4gTfxNc8Rv/view?usp=drive_link">Download</a> | <a href="https://drive.google.com/file/d/1l_isAhyRTr7n8qpzXaa8y6hz2BSyR95Y/view?usp=drive_link">Download</a> | Link

Image Caption on the COCO Caption Dataset

Dataset & Annotation

Download the COCO Caption dataset, unzip it under the datasets folder, and accordingly modify the image_root in config. Download all-in-one annotations from this link, unzip it under the annotation folder, and accordingly modify the annotation in config. See here for expected folder structres.

Evaluation

python -m torch.distributed.run --nproc_per_node=8 compress_caption_dtp.py --evaluate \
--pretrained output/caption_coco_compression_p0.5/model_base_caption_capfilt_large_coco_p0.5_compressed.pth \
--config ./configs/caption_coco.yaml \
--output_dir output/caption_coco_compression_p0.5

Compression

Download the uncompressed model from the table below, put it under the pretrained folder, and accordingly modify the pretrained in config. For example, to conduct a compression at 0.5 reduce ratio on 8 A100 GPUs (80G):

python -m torch.distributed.run --nproc_per_node=8 compress_caption_dtp.py --p 0.5 --epoch 5 \
--pretrained pretrained/model_base_caption_capfilt_large.pth \
--config ./configs/caption_coco.yaml \
--output_dir output/caption_coco_compression_p0.5

MADTP

Install / Use

README