AQLM
Official Pytorch repository for Extreme Compression of Large Language Models via Additive Quantization https://arxiv.org/pdf/2401.06118.pdf and PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression https://arxiv.org/abs/2405.14852
Install / Use
/learn @Vahe1994/AQLMREADME
AQLM
Official PyTorch implementation for Extreme Compression of Large Language Models via Additive Quantization
[2025.04] Released aqlm v1.1.7. Added support for arbitrary 8-dimensional codebooks on GPU, improved accuracy for 1-bit models, e.g. ISTA-DASLab/Llama-2-7b-AQLM-1Bit-1x8-hf at ~1 bit achieves WikiText 2 PPL 7.85. To quantize your own models this way, use num_codebooks=1, nbits_per_codebook=256 as per the tutorial below.
[2024.11] PV-tuning was accepted to NeurIPS'2024 for oral presentation!
[2024.05] AQLM was accepted to ICML'2024! If you're attending, meet us around this poster.
[2024.06] We released a new paper that extends AQLM with new finetuning algorithm called PV-tuning. We're also releasing PV-tuned AQLM models in this collection
[2024.08] We have merged the PV-Tuning branch into the main branch. To reproduce results with old finetuning (before Aug 21), use commit 559a366.
Inference
Demo
Learn how to run the prequantized models using this Google Colab examples:
| Basic AQLM <br> generation | Streaming with <br> GPU/CPU | Inference with CUDA <br> graphs (3x speedup) | Fine-tuning <br> with PEFT | Serving with <br> vLLM |
|:-----------:|:-------:|:---------------:|:----------:|:--------:|
| <a target="_blank" href="https://colab.research.google.com/github/Vahe1994/AQLM/blob/main/notebooks/colab_example.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="AQLM In Colab"/></a> | <a target="_blank" href="https://colab.research.google.com/github/Vahe1994/AQLM/blob/main/notebooks/streaming_example.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="AQLM In Colab"/></a> | <a target="_blank" href="https://colab.research.google.com/github/Vahe1994/AQLM/blob/main/notebooks/aqlm_cuda_graph.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> | <a target="_blank" href="https://colab.research.google.com/github/Vahe1994/AQLM/blob/main/notebooks/aqlm_2bit_training.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> | <a target="_blank" href="https://colab.research.google.com/github/Vahe1994/AQLM/blob/main/notebooks/aqlm_vllm.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> |
Browser demo (Rust/WASM)
If you want to try AQLM+PV inference on CPU directly in your browser, check out aqlm-rs:
- Live demo: galqiwi.github.io/aqlm-rs/about.html
- Source code: galqiwi/demo-aqlm-rs
Models
This repository is currently designed to work with models of LLaMA, Mistral and Mixtral families.
The models reported below use full model fine-tuning as described in appendix A, with cross-entropy objective with teacher logits.
We provide a number of prequantized AQLM models without PV-Tuning (scroll down for PV-Tuned models):
| Model | AQLM scheme | WikiText-2 PPL | MMLU (5-shot) FP16→AQLM | Model size, Gb | Hub link | |------------|-------------|----------------|---------------|----------------|--------------------------------------------------------------------------| | Llama-3-8b | 1x16 | - | 0.65→0.56 | 4.1 | Link | | Llama-3-8b-Instruct | 1x16 | - | 0.66→0.59 | 4.1 | Link | | Llama-3-70b | 1x16 | - | 0.79→0.75 | 21.9 | Link | | Llama-3-70b-Instruct | 1x16 | - | 0.80→0.76 | 21.9 | Link | | Command-R | 1x16 | - | 0.68→0.57 | 12.7 | Link| | Command-R+ | 1x16 | - | 0.74→0.68 | 31.9 | Link| | Mistral-7b| 1x16 | 5.40 | - | 2.5 | Link| | Mistral-7B-Instruct-v0.2 | 2x8 | - | 0.59→0.44 | 2.5 | Link| | Mixtral-8x7b| 1x16 | 3.35 | -| 12.6 | Link| | Mixtral-8x7b-Instruct| 1x16 | - | -| 12.6 | Link| | Llama-2-7b | 1x16 | 5.92 | 0.46→0.39 | 2.4 | Link | | Llama-2-7b | 2x8 | 6.69 | - | 2.2 | Link | | Llama-2-7b | 8x8 | 6.61 | - | 2.2 | Link | | Llama-2-13b| 1x16 | 5.22 | 0.55→0.49 | 4.1 | Link| | Llama-2-13b| 2x8 | 5.63 | - | 3.8 | Link| | Llama-2-70b| 1x16 | 3.83 | 0.69→0.65 | 18.8 | Link| | Llama-2-70b| 2x8 | 4.21 | - | 18.2 | Link | | gemma-2b | 1x16 | - | - | 1.7 | Link| | gemma-2b | 2x8 | - | - | 1.6 | Link|
You can also download AQLM models tuned via PV-tuning:
| Model | AQLM scheme | WikiText-2 PPL | Model size, Gb | Hub link | |------------|-------------|----------------|----------------|--------------------------------------------------------------------------| | Llama-2-7b | 1x16g8 | 5.68 | 2.4 | Link | | Llama-2-7b | 2x8g8 | 5.90 | 2.2 | Link | | Llama-2-7b | 1x16g16 | 9.21 | 1.7 | Link | | Llama-2-7b | 1x8g8 (New!) | 7.85 | 1.34 | Link | | Llama-2-13b| 1x16g8 | 5.05 | 4.1 | Link| | Llama-2-70b| 1x16g8 | 3.78 | 18.8 | Link| | Meta-Llama-3.2-1B | 2x8g8 | - | 0.8 | Link | | Meta-Llama-3.2-1B-Instruct | 2x8g8 | - | 0.8 | Link | | Meta-Llama-3.2-3B | 2x8g8 | - | 1.5 | Link | | Meta-Llama-3.2-3B-Instruct | 2x8g8 | - | 1.5 | Link | | Meta-Llama-3-8B | 1x16g8 | 6.99 | 4.1 | Link | | Meta-Llama-3-8B | 1x16g16 | 9.43 | 3.9 | Link | | Meta-Llama-3.1-8B | 1x16g16 | - | - | Link | | Meta-Llama-3.1-8B | 1x16g8 | - | - | Link | | Meta-Llama-3.1-8B-Instruct | 1x16g16 | - | - | Link | | Meta-Llama-3.1-8B-Instruct | 1x16g8 | - | - | Link | | Meta-Llama-3.1-8B-Instruct | 2x8g8 | - | - | Link | | Meta-Llama-3-70B | 1x16g8 | 4.57 | 21.9 | Link| | Meta-Llama-3-70B | 1x16g16 | 8.67 | 13 | Link| | Meta-Llama-3.
Related Skills
node-connect
337.1kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
83.1kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
337.1kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
83.1kCommit, push, and open a PR
