SkillAgentSearch skills...

AQLM

Official Pytorch repository for Extreme Compression of Large Language Models via Additive Quantization https://arxiv.org/pdf/2401.06118.pdf and PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression https://arxiv.org/abs/2405.14852

Install / Use

/learn @Vahe1994/AQLM
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

AQLM

Official PyTorch implementation for Extreme Compression of Large Language Models via Additive Quantization

[2025.04] Released aqlm v1.1.7. Added support for arbitrary 8-dimensional codebooks on GPU, improved accuracy for 1-bit models, e.g. ISTA-DASLab/Llama-2-7b-AQLM-1Bit-1x8-hf at ~1 bit achieves WikiText 2 PPL 7.85. To quantize your own models this way, use num_codebooks=1, nbits_per_codebook=256 as per the tutorial below.

[2024.11] PV-tuning was accepted to NeurIPS'2024 for oral presentation!

[2024.05] AQLM was accepted to ICML'2024! If you're attending, meet us around this poster.

[2024.06] We released a new paper that extends AQLM with new finetuning algorithm called PV-tuning. We're also releasing PV-tuned AQLM models in this collection

[2024.08] We have merged the PV-Tuning branch into the main branch. To reproduce results with old finetuning (before Aug 21), use commit 559a366.

Inference

Demo

Learn how to run the prequantized models using this Google Colab examples:

| Basic AQLM <br> generation | Streaming with <br> GPU/CPU | Inference with CUDA <br> graphs (3x speedup) | Fine-tuning <br> with PEFT | Serving with <br> vLLM | |:-----------:|:-------:|:---------------:|:----------:|:--------:| | <a target="_blank" href="https://colab.research.google.com/github/Vahe1994/AQLM/blob/main/notebooks/colab_example.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="AQLM In Colab"/></a> | <a target="_blank" href="https://colab.research.google.com/github/Vahe1994/AQLM/blob/main/notebooks/streaming_example.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="AQLM In Colab"/></a> | <a target="_blank" href="https://colab.research.google.com/github/Vahe1994/AQLM/blob/main/notebooks/aqlm_cuda_graph.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> | <a target="_blank" href="https://colab.research.google.com/github/Vahe1994/AQLM/blob/main/notebooks/aqlm_2bit_training.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> | <a target="_blank" href="https://colab.research.google.com/github/Vahe1994/AQLM/blob/main/notebooks/aqlm_vllm.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> |

Browser demo (Rust/WASM)

If you want to try AQLM+PV inference on CPU directly in your browser, check out aqlm-rs:

Models

This repository is currently designed to work with models of LLaMA, Mistral and Mixtral families. The models reported below use full model fine-tuning as described in appendix A, with cross-entropy objective with teacher logits.

We provide a number of prequantized AQLM models without PV-Tuning (scroll down for PV-Tuned models):

| Model | AQLM scheme | WikiText-2 PPL | MMLU (5-shot) FP16→AQLM | Model size, Gb | Hub link | |------------|-------------|----------------|---------------|----------------|--------------------------------------------------------------------------| | Llama-3-8b | 1x16 | - | 0.65→0.56 | 4.1 | Link | | Llama-3-8b-Instruct | 1x16 | - | 0.66→0.59 | 4.1 | Link | | Llama-3-70b | 1x16 | - | 0.79→0.75 | 21.9 | Link | | Llama-3-70b-Instruct | 1x16 | - | 0.80→0.76 | 21.9 | Link | | Command-R | 1x16 | - | 0.68→0.57 | 12.7 | Link| | Command-R+ | 1x16 | - | 0.74→0.68 | 31.9 | Link| | Mistral-7b| 1x16 | 5.40 | - | 2.5 | Link| | Mistral-7B-Instruct-v0.2 | 2x8 | - | 0.59→0.44 | 2.5 | Link| | Mixtral-8x7b| 1x16 | 3.35 | -| 12.6 | Link| | Mixtral-8x7b-Instruct| 1x16 | - | -| 12.6 | Link| | Llama-2-7b | 1x16 | 5.92 | 0.46→0.39 | 2.4 | Link | | Llama-2-7b | 2x8 | 6.69 | - | 2.2 | Link | | Llama-2-7b | 8x8 | 6.61 | - | 2.2 | Link | | Llama-2-13b| 1x16 | 5.22 | 0.55→0.49 | 4.1 | Link| | Llama-2-13b| 2x8 | 5.63 | - | 3.8 | Link| | Llama-2-70b| 1x16 | 3.83 | 0.69→0.65 | 18.8 | Link| | Llama-2-70b| 2x8 | 4.21 | - | 18.2 | Link | | gemma-2b | 1x16 | - | - | 1.7 | Link| | gemma-2b | 2x8 | - | - | 1.6 | Link|

You can also download AQLM models tuned via PV-tuning:

| Model | AQLM scheme | WikiText-2 PPL | Model size, Gb | Hub link | |------------|-------------|----------------|----------------|--------------------------------------------------------------------------| | Llama-2-7b | 1x16g8 | 5.68 | 2.4 | Link | | Llama-2-7b | 2x8g8 | 5.90 | 2.2 | Link | | Llama-2-7b | 1x16g16 | 9.21 | 1.7 | Link | | Llama-2-7b | 1x8g8 (New!) | 7.85 | 1.34 | Link | | Llama-2-13b| 1x16g8 | 5.05 | 4.1 | Link| | Llama-2-70b| 1x16g8 | 3.78 | 18.8 | Link| | Meta-Llama-3.2-1B | 2x8g8 | - | 0.8 | Link | | Meta-Llama-3.2-1B-Instruct | 2x8g8 | - | 0.8 | Link | | Meta-Llama-3.2-3B | 2x8g8 | - | 1.5 | Link | | Meta-Llama-3.2-3B-Instruct | 2x8g8 | - | 1.5 | Link | | Meta-Llama-3-8B | 1x16g8 | 6.99 | 4.1 | Link | | Meta-Llama-3-8B | 1x16g16 | 9.43 | 3.9 | Link | | Meta-Llama-3.1-8B | 1x16g16 | - | - | Link | | Meta-Llama-3.1-8B | 1x16g8 | - | - | Link | | Meta-Llama-3.1-8B-Instruct | 1x16g16 | - | - | Link | | Meta-Llama-3.1-8B-Instruct | 1x16g8 | - | - | Link | | Meta-Llama-3.1-8B-Instruct | 2x8g8 | - | - | Link | | Meta-Llama-3-70B | 1x16g8 | 4.57 | 21.9 | Link| | Meta-Llama-3-70B | 1x16g16 | 8.67 | 13 | Link| | Meta-Llama-3.

Related Skills

View on GitHub
GitHub Stars1.3k
CategoryDevelopment
Updated6h ago
Forks194

Languages

Python

Security Score

95/100

Audited on Mar 26, 2026

No findings