<div align ="center"> <h2> [CVPR 2025 Highlight] CASP: Compression of Large Multimodal Models Based on Attention Sparsity</h2>

Mohsen Gholami, Mohammad Akabri, Kevin Cannons, Yong Zhang,

Huawei Technologies Canada

Link to Huawei's AI Gallery Notebook: https://developer.huaweicloud.com/develop/aigallery/notebook/detail?id=27904e61-36d1-4a72-97f3-b6ed57905f99

</div> <img src="assets/teaser.png" alt="Image 1" width="38%" /> <img src="assets/figure_proof.png" alt="Image 2" width="58%" />

Highlights

CASP proposes a 2-bit compression method for VLMs that is compatible with any quantization technique and enhances state-of-the-art 2-bit quantization methods (AQLM and QuIP#) by an average of 21% on image- and video-language benchmarks

Installation:

Install the requirements via pip install -r requirements.txt.

Quip#:

Build and install the CUDA inference kernels. (cd quip-sharp/quiptools && python setup.py install && cd ../)
Install the fast-hadamard-transform package using their github repo.

AQLM:

pip install aqlm[gpu,cpu]

Quantization:

CASPQuIP# :

Follow the below steps to prepare CASPQuIP# for LLaVA-1.5-7B. If you want to quantize LLaVA-1.5-13B or LLaVA-Next you can set the --model in the scripts accordingly. If you want to qunatize LLaMA-7B you should use svd_llama.sh,hfize_llama.sh, and quantize_finetune_llama.sh in the below steps.

To prepare LLaVA-1.5-7B with low-rank compressed Wq and Wk.
```
bash SVD/scripts/svd_llava.sh
```

To prepare hessians for QuIP#:

bash quip-sharp/scripts/hfize_llava.sh

Quantization:

bash quip-sharp/scripts/quantize_finetune_llava.sh

CASPAQLM :

Follow the below steps to prepare CASPAQLM for LLaVA-1.5-7B. If you want to quantize LLaVA-1.5-13B or LLaVA-Next you can set the --model in the scripts accordingly. If you want to qunatize LLaMA-7B you should use svd_llama.sh and quantize_llama.sh in the below steps.

To prepare llava with low-rank compressed Wq and Wk :
```
bash SVD/scripts/svd_llava.sh
```
Quantization:
```
bash AQLM/scripts/quantize_llava.sh 
```

CASPGPTQ :

Follow the below steps to prepare CASPGPTQ for LLaVA-1.5-7B. If you want to quantize LLaVA-1.5-13B or LLaVA-Next you can set the --model in the scripts accordingly. If you want to qunatize LLaMA-7B you should use svd_llama.sh and quantize_llama.sh in the below steps.

To prepare llava with low-rank compressed Wq and Wk:
```
bash SVD/scripts/svd_llava.sh
```
Quantization:
```
bash GPTQ/scripts/quantize_llava.sh
```

📚 Citation

If you find CASP useful in your research or applications, please consider giving us a star 🌟 and citing it by the following BibTeX entry.

@misc{gholami2025caspcompressionlargemultimodal,
      title={CASP: Compression of Large Multimodal Models Based on Attention Sparsity}, 
      author={Mohsen Gholami and Mohammad Akbari and Kevin Cannons and Yong Zhang},
      year={2025},
      eprint={2503.05936},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.05936}, 
}

Casp

Install / Use

README

Highlights

Installation:

Quip#:

AQLM:

Quantization:

CASP<sub>QuIP#</sub> :

CASP<sub>AQLM</sub> :

CASP<sub>GPTQ</sub> :

📚 Citation