CommVQ
[ICML 2025] CommVQ: Commutative Vector Quantization for KV Cache Compression
Install / Use
/learn @UMass-Embodied-AGI/CommVQREADME
CommVQ: Commutative Vector Quantization for KV Cache Compression
This repository contains the official implementation of CommVQ, a method for memory-efficient and long-context inference through KV cache quantization with learned codebooks. It achieves strong performance across a wide range of benchmarks while significantly reducing memory overhead.
Table of Contents
News
- [June, 2025]: Released code and model weights.
- [May, 2025]: CommVQ is accepted to ICML 2025! See you in Vancouver, BC.
Model Checkpoints
We release the following LLaMA-3.1 8B checkpoints with CommVQ 1-bit and 2-bit compression. Both value codebooks and key codebooks are provided below. The value codebooks are used together with the original (unchanged) model weights.
| Model Variant | Value Codebook | Key Codebook | |---------------|-------|----------| | LLaMA-3.1 8B + CommVQ 1-bit | 🤗 Hugging Face | 🤗 Hugging Face | | LLaMA-3.1 8B + CommVQ 2-bit | 🤗 Hugging Face | 🤗 Hugging Face |
Installation
conda create -n commvq python=3.10
conda activate commvq
pip install -e .
pip install flash-attn --no-build-isolation
Training
cd training
# Step 1: Collect KV cache
bash collect_kv.sh
# Step 2: Prepare scaling factors
python make_scale.py
# Step 3: Train the codebook for key cache
bash quantize_key_cache.sh
# Step 4: Train the codebook for value cache
bash finetune/llama3.1_8b_int1.sh
Evaluation
Longbench
cd evaluation/longbench
python pred.py --model $CHECKPOINT
python eval.py --model $RESULT_DIR
Infinitebench
cd evaluation/infiniteBench/src
# Download the evaluation datasets
bash scripts/download_dataset.sh
# Evaluate each tasks
bash run_passkey.sh
# Merge all results in each task into one jsonl file
cat ../results/commvq/preds_passkey_*.jsonl > ../results/commvq/preds_passkey.jsonl
# Compute the task score
python compute_scores.py --task all --model_name commvq
NIAH
cd evaluation/niah
bash run.sh $CHECKPOINT
Memory Measurement
We implement Triton-based kernels to further optimize memory usage and enable real memory savings with CommVQ. (Currently supports LLaMA-3.1 8B with 1-bit quantization; ongoing development for broader model support.)
cd evaluation/memory_measurement
pip install -e ../../transformers_triton_infer
bash eval_memory.sh $CHECKPOINT
Citation
If you find CommVQ useful in your research or applications, please consider citing:
@inproceedings{li2025commvq,
title = {CommVQ: Commutative Vector Quantization for KV Cache Compression},
author = {Junyan Li and Yang Zhang and Muhammad Yusuf Hassan and Talha Chafekar and Tianle Cai and Zhile Ren and Pengsheng Guo and Binazir Karimzadeh and Colorado J Reed and Chong Wang and Chuang Gan},
booktitle = {Proceedings of the 42nd International Conference on Machine Learning (ICML)},
year = {2025}
}
Related Skills
node-connect
349.2kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
109.5kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
349.2kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
349.2kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
