AQLM

Official Pytorch repository for Extreme Compression of Large Language Models via Additive Quantization https://arxiv.org/pdf/2401.06118.pdf and PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression https://arxiv.org/abs/2405.14852

Generate Convert Improve

Install / Use

/learn @Vahe1994/AQLM

About this skill

Quality Score

0/100

README

AQLM

Official PyTorch implementation for Extreme Compression of Large Language Models via Additive Quantization

[2025.04] Released aqlm v1.1.7. Added support for arbitrary 8-dimensional codebooks on GPU, improved accuracy for 1-bit models, e.g. ISTA-DASLab/Llama-2-7b-AQLM-1Bit-1x8-hf at ~1 bit achieves WikiText 2 PPL 7.85. To quantize your own models this way, use num_codebooks=1, nbits_per_codebook=256 as per the tutorial below.

[2024.11] PV-tuning was accepted to NeurIPS'2024 for oral presentation!

[2024.05] AQLM was accepted to ICML'2024! If you're attending, meet us around this poster.

[2024.06] We released a new paper that extends AQLM with new finetuning algorithm called PV-tuning. We're also releasing PV-tuned AQLM models in this collection

[2024.08] We have merged the PV-Tuning branch into the main branch. To reproduce results with old finetuning (before Aug 21), use commit 559a366.

Inference

Demo

Learn how to run the prequantized models using this Google Colab examples:

| Basic AQLM generation | Streaming with GPU/CPU | Inference with CUDA graphs (3x speedup) | Fine-tuning with PEFT | Serving with vLLM | |:-----------:|:-------:|:---------------:|:----------:|:--------:| | <a target="_blank" href="https://colab.research.google.com/github/Vahe1994/AQLM/blob/main/notebooks/colab_example.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="AQLM In Colab"/></a> | <a target="_blank" href="https://colab.research.google.com/github/Vahe1994/AQLM/blob/main/notebooks/streaming_example.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="AQLM In Colab"/></a> | <a target="_blank" href="https://colab.research.google.com/github/Vahe1994/AQLM/blob/main/notebooks/aqlm_cuda_graph.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> | <a target="_blank" href="https://colab.research.google.com/github/Vahe1994/AQLM/blob/main/notebooks/aqlm_2bit_training.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> | <a target="_blank" href="https://colab.research.google.com/github/Vahe1994/AQLM/blob/main/notebooks/aqlm_vllm.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> |

Browser demo (Rust/WASM)

If you want to try AQLM+PV inference on CPU directly in your browser, check out aqlm-rs:

Live demo: galqiwi.github.io/aqlm-rs/about.html
Source code: galqiwi/demo-aqlm-rs

Models

This repository is currently designed to work with models of LLaMA, Mistral and Mixtral families. The models reported below use full model fine-tuning as described in appendix A, with cross-entropy objective with teacher logits.

We provide a number of prequantized AQLM models without PV-Tuning (scroll down for PV-Tuned models):

| Model |------------|----- | Llama-3-8b | 1x16 | Llama-3-8b-Instruct | 1x16 | Llama-3-70b | 1x16 | Llama-3-70b-Instruct | 1x16 | Command-R | 1x16 | Command-R+ | 1x16 | Mistral-7b| 1x16 | Mistral-7B-Instruct-v0.2 | Mixtral-8x7b| 1x16 | Mixtral-8x7b-Instruct| 1x16 | Llama-2-7b | 1x16 | Llama-2-7b | 2x8 | Llama-2-7b | 8x8 | Llama-2-13b| 1x16 | Llama-2-13b| 2x8 | Llama-2-70b| 1x16 | Llama-2-70b| 2x8 | gemma-2b | 1x16 | - | gemma-2b | 2x8 | - | AQLM scheme | WikiText-2 PPL | MMLU (5-shot) FP16→AQLM | Model size, Gb | Hub link | --------|----------------|---------------|----------------|--------------------------------------------------------------------------| | - | 0.65→0.56 | 4.1 | Link | | - | 0.66→0.59 | 4.1 | Link | | - | 0.79→0.75 | 21.9 | Link | | - | 0.80→0.76 | 21.9 | Link | | - | 0.68→0.57 | 12.7 | Link| | - | 0.74→0.68 | 31.9 | Link| | 5.40 | - | 2.5 | Link| | 2x8 | - | 0.59→0.44 | 2.5 | Link| | 3.35 | -| 12.6 | Link| | - | -| 12.6 | Link| | 5.92 | 0.46→0.39 | 2.4 | Link | | 6.69 | - | 2.2 | Link | | 6.61 | - | 2.2 | Link | | 5.22 | 0.55→0.49 | 4.1 | Link| | 5.63 | - | 3.8 | Link| | 3.83 | 0.69→0.65 | 18.8 | Link| | 4.21 | - | 18.2 | Link | | - | 1.7 | Link| | - | 1.6 | Link|

You can also download AQLM models tuned via PV-tuning:

| Model | AQLM scheme | WikiText-2 PPL | Model size, Gb | Hub link | |------------|-------------|----------------|----------------|--------------------------------------------------------------------------| | Llama-2-7b | 1x16g8 | 5.68 | 2.4 | Link | | Llama-2-7b | 2x8g8 | 5.90 | 2.2 | Link | | Llama-2-7b | 1x16g16 | 9.21 | 1.7 | Link | | Llama-2-7b | 1x8g8 (New!) | 7.85 | 1.34 | Link | | Llama-2-13b| 1x16g8 | 5.05 | 4.1 | Link| | Llama-2-70b| 1x16g8 | 3.78 | 18.8 | Link| | Meta-Llama-3.2-1B | 2x8g8 | - | 0.8 | Link | | Meta-Llama-3.2-1B-Instruct | 2x8g8 | - | 0.8 | Link | | Meta-Llama-3.2-3B | 2x8g8 | - | 1.5 | Link | | Meta-Llama-3.2-3B-Instruct | 2x8g8 | - | 1.5 | Link | | Meta-Llama-3-8B | 1x16g8 | 6.99 | 4.1 | Link | | Meta-Llama-3-8B | 1x16g16 | 9.43 | 3.9 | Link | | Meta-Llama-3.1-8B | 1x16g16 | - | - | Link | | Meta-Llama-3.1-8B | 1x16g8 | - | - | Link | | Meta-Llama-3.1-8B-Instruct | 1x16g16 | - | - | Link | | Meta-Llama-3.1-8B-Instruct | 1x16g8 | - | - | Link | | Meta-Llama-3.1-8B-Instruct | 2x8g8 | - | - | Link | | Meta-Llama-3-70B | 1x16g8 | 4.57 | 21.9 | Link| | Meta-Llama-3-70B | 1x16g16 | 8.67 | 13 | Link| | Meta-Llama-3.

Related Skills

node-connect

337.1k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

83.1k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

337.1k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

commit-push-pr

83.1k

Commit, push, and open a PR