LeanQuant
Code repository for ICLR 2025 paper "LeanQuant: Accurate and Scalable Large Language Model Quantization with Loss-error-aware Grid"
Install / Use
/learn @LeanModels/LeanQuantREADME
🔍 What is LeanQuant?
LeanQuant is an efficient large language model (LLM) quantization framework that minimizes quality loss while maximizing computational and memory efficiency. It introduces a loss-error-aware quantization grid that preserves outliers in the inverse Hessian—achieving superior model quality without extra storage or inference overhead.
📄 Read the full paper: arXiv
🚀 Why LeanQuant?
✅ Scalable Quantization – Handles ultra-large models with one or two GPUs
✅ Efficient Inference – Optimized 4-bit CUDA kernels for fast and memory-efficient execution
✅ High Accuracy – Compares favorably against state-of-the-art quantization methods
✅ Versatile – Supports non-uniform and affine quantization formats
✅ Minimal Dependencies – Easy installation and setup
✅ Broad Compatibility – Works on most CUDA GPUs, and supports multi-GPU distributed inference
🛠️ Quick Start
- Make sure you have a Linux environment, a CUDA-enabled GPU, and Python and PyTorch installed.
- Install our pip package.
# For CUDA 11.x
pip install leanquant[cuda11]
# For CUDA 12.x
pip install leanquant[cuda12]
- Download a LeanQuant model from the Model Zoo below or from our HuggingFace page. Each downloaded model is a
.safetensorsfile. For example, download the 4-bitLlama-3.1-8B-Instructfrom this link or with the commandwget https://huggingface.co/LeanQuant/Llama-3.1-8B-Instruct-nu-4bit/resolve/main/model.safetensors. - The model can now be loaded for inference using the following script:
from leanquant import LeanQuantModelForCausalLM
model = LeanQuantModelForCausalLM.from_pretrained(
"<base-model-name>",
"<path-to-model-safetensors>",
bits=<bit-width>,
device_map="auto"
)
A Complete Example: The following script shows how to run inference with a 4-bit Llama-3.1-8B-Instruct (with the model downloaded to ./model.safetensors):
import torch
from leanquant import LeanQuantModelForCausalLM
from transformers import AutoTokenizer
### Load model and tokenizer
base_model_name = "meta-llama/Llama-3.1-8B-Instruct"
model = LeanQuantModelForCausalLM.from_pretrained(
base_model_name,
"./model.safetensors",
bits=4,
device_map="auto"
)
model.eval()
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
### Tokenize prompt
prompt = [
{"role": "system", "content": "You are a helpful assistant, that responds as a pirate."},
{"role": "user", "content": "What is quantization for deep learning models?"},
]
inputs = tokenizer.apply_chat_template(
prompt,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt",
return_dict=True,
).to(model.device)
### Run generation and decode generated tokens
with torch.no_grad():
output = model.generate(**inputs, do_sample=True, max_new_tokens=256)
generated_text = tokenizer.decode(output[0], skip_special_tokens=False)
print(generated_text)
🦁 Model Zoo
Explore our collection of pre-quantized models for efficient deployment.
| Base Model Name | Quantized Bits| Download Link | |-------------------------------------------------|---------------|----------------------------------------------------------------| | meta-llama/Meta-Llama-3-8B | 4-bit | Download | | meta-llama/Meta-Llama-3-8B | 3-bit | Download | | meta-llama/Meta-Llama-3-8B | 2-bit | Download | | meta-llama/Llama-2-7b-hf | 4-bit | Download | | meta-llama/Llama-2-7b-hf | 3-bit | Download | | meta-llama/Llama-2-7b-hf | 2-bit | Download | | meta-llama/Llama-2-13b-hf | 4-bit | Download | | meta-llama/Llama-2-13b-hf | 3-bit | Download | | meta-llama/Llama-2-13b-hf | 2-bit | Download | | mistralai/Mistral-7B-v0.1 | 4-bit | Download | | mistralai/Mistral-7B-v0.1 | 3-bit | Download | | mistralai/Mistral-7B-v0.1 | 2-bit | Download | | huggyllama/llama-13b | 4-bit | Download | | huggyllama/llama-13b | 3-bit | Download | | huggyllama/llama-13b | 2-bit | Download | | meta-llama/Meta-Llama-3-8B-Instruct | 4-bit | Download | | meta-llama/Llama-3.1-8B | 4-bit | Download | | meta-llama/Llama-3.1-8B-Instruct | 4-bit | Download | | meta-llama/Llama-3.1-70B | 4-bit | Download | | meta-llama/Llama-3.3-70B-Instruct | 4-bit | Download |
🚀 More models coming soon!
📌 How to Quantize and Evaluate a Model
Follow these steps to quantize and evaluate a large language model.
Requirements
- At least one CUDA-enabled GPU is required for quantization and evaluation.
- A Linux environment is recommended.
Setup
- Clone the Repository
git clone https://github.com/LeanModels/LeanQuant.git
cd LeanQuant
- [Optional] Create a Conda Environment
conda create -n leanquant python=3.10
conda activate leanquant
- Install Dependencies
# For CUDA 11.x
pip install cupy-cuda11x
# For CUDA 12.x
pip install cupy-cuda12x
- Install Additional Requirements
pip install -r requirements.txt
Quantizing Models
We currently support the Llama and Mistral family models (non-VLM, non-MOE). To quantize a model using LeanQuant, run the following command:
python llama.py <huggingface-model-name-or-path> <calibration-dataset-name> \
--new-eval \
--wbits 4 \
--nsamples 128 \
--true-sequential --act-order \
--percdamp 0.1 \
--exponent 4 \
--save_path <quantized-model-name>.safetensors
Parameter Explanation:
| Parameter | Description |
|--------------------------------|-------------|
| <huggingface-model-name-or-path> | The HuggingFace model name or local model path to quantize. Example: meta-llama/Llama-3.1-8B-Instruct. |
| <calibration-dataset-name> | Calibration dataset for quantization. Choices: wikitext2, ptb, c4, c4-new (recommended: c4-new). |
| --new-eval | Enables new evaluation mode for perplexity testing. |
| --wbits | Bit-width for quantization. Choices: 4, 3, or 2. |
| --nsamples | Number of calibration samples. Recommended: 128, 256, or 512. |
| --true-sequential & --act-order | Improves quantized model quality. Recommended to enable. |
| --percdamp | Dampening applied to the Hessian. Recommended: 0.1 or 0.01. |
| --exponent | Strength parameter for preserving outliers in quantization (p in the paper). Recommended: 3 or 4. |
| --save_path | Path and filename to save the quantized model. |
Example:
To quantize meta-llama/Llama-3.1-8B-Instruct to 4-bit precision, run:
python llama.py meta-llama/Llama-3.1-8B-Instruct c4-new \
