EfficientQAT

Official PyTorch implement of paper EfficientQAT: Efficient Quantization-Aware Training for Large Language Models

News

[2025/11] 🔥 We open-source INT vs. FP, a framework to compare low-bit integer and float-point formats, including MXFP8/MXFP6/MXFP4/NVFP4 and MXINT8/MXINT6/MXINT4/NVINT4.
[2025/05] 🔥 We explore the Scaling Law for Quantization-Aware Training, which offers insights and instruction for LLMs QAT.
[2025/05] 🌟 Our EfficientQAT paper has been accepted for ACL 2025 Main Conference! 🎉 Cheers!
[2024/10] 🔥 We release a new weight-activation quantization algorithm, PrefixQuant, which proposed an efficient method to isolate sink token (token-wise outlier).
[2024/08] The new inference backend T-MAC from Microsoft has supported EffcientQAT models.
[2024/08] We support for the quantization of Mistral-Large-Instruct. W2g64 Mistral-Large-Instruct with our EfficientQAT can compress the 123B models to 35 GB with only 4 points accuracy degeneration.
[2024/07] New featurs! We support to transfer EfficientQAT quantized models into GPTQ v2 format and BitBLAS format, which can be directly loaded through GPTQModel.
[2024/07] We release EfficientQAT, which pushes the limitation of uniform (INT) quantization in an efficient manner.

Installation
Model Zoo
Training
Inference
Model Transferring
Inference of Other Formats
Citation

Installation

Clone this repository and navigate to EfficientQAT folder

git clone https://github.com/OpenGVLab/EfficientQAT.git
cd EfficientQAT

Install package

conda create -n efficientqat python==3.11

conda activate efficientqat

pip install -r requirements.txt

Model Zoo

We provide a number of prequantized EfficientQAT models as follows:

WikiText2 PPL is measured in 2048 context length.
Avg. Accuracy indicate the average accuracy in 5 zero-shot reasoning tasks (WinoGrande,PIQA,HellaSwag,Arc-Easy, Arc-Challenge) with lm-eval v0.4.2.
1GB = $10^9$ Bit
Hub Link: EQAT indicates the original checkpoints. We also transfer the checkpoints into GPTQ and BitBLAS formats, which can be loaded directly through GPTQModel. (PS: GPTQModel is a official bug-fixed repo of AutoGPTQ, which would be merged into AutoGPTQ in future.)

| Model | Quantization | WikiText2 PPL | Avg. Accuracy | Model Size (GB) | Hub link| |-------|--------------|---------------|---------------|-----------------|----------| Llama-2-7B|fp16|5.47|64.86|13.2|-| Llama-2-7B|w4g128|5.53|64.27|3.7|EQAT|GPTQ|BitBLAS| Llama-2-7B|w3g128|5.81|64.02|3.1|EQAT| Llama-2-7B|w2g64|6.86|60.14|2.3|EQAT|GPTQ|BitBLAS| Llama-2-7B|w2g128|7.17|59.50|2.2|EQAT|GPTQ|BitBLAS| Llama-2-13B|fp16|4.88|67.81|25.4|-| Llama-2-13B|w4g128|4.93|67.52|6.8|EQAT|GPTQ|BitBLAS| Llama-2-13B|w3g128|5.12|67.28|5.6|EQAT| Llama-2-13B|w2g64|5.96|64.88|4.0|EQAT|GPTQ|BitBLAS| Llama-2-13B|w2g128|6.08|63.88|3.8|EQAT|GPTQ|BitBLAS| Llama-2-70B|fp16|3.32|72.41|131.6|-| Llama-2-70B|w4g128|3.39|72.62|35.8|EQAT|GPTQ|BitBLAS| Llama-2-70B|w3g128|3.61|71.76|29.1|EQAT| Llama-2-70B|w2g64|4.52|69.48|20.1|EQAT|GPTQ|BitBLAS| Llama-2-70B|w2g128|4.61|68.93|18.9|EQAT|GPTQ|BitBLAS| Llama-3-8B|fp16|6.14|68.58|13.0|-| Llama-3-8B|w4g128|6.47|68.43|5.4|EQAT|GPTQ|BitBLAS| Llama-3-8B|w3g128|7.09|67.35|4.7|EQAT| Llama-3-8B|w2g64|9.41|60.76|3.9|EQAT|GPTQ|BitBLAS| Llama-3-8B|w2g128|9.80|59.36|3.8|EQAT|GPTQ|BitBLAS| Llama-3-70B|fp16|2.85|75.33|137.8|-| Llama-3-70B|w4g128|3.17|74.57|38.9|EQAT|GPTQ|BitBLAS| Llama-3-70B|w3g128|4.19|72.42|32.2|EQAT| Llama-3-70B|w2g64|6.08|67.89|23.2|EQAT|GPTQ| Llama-3-70B|w2g128|6.38|67.57|22.0|EQAT|GPTQ|BitBLAS| Llama-3-8B-Instruct|fp16|8.29|68.43|13.0|-| Llama-3-8B-Instruct|w4g128|7.93|68.39|5.4|EQAT|GPTQ|BitBLAS| Llama-3-8B-Instruct|w3g128|8.55|67.24|4.7|EQAT| Llama-3-8B-Instruct|w2g64|11.19|60.66|3.9|EQAT|GPTQ|BitBLAS| Llama-3-8B-Instruct|w2g128|11.73|60.16|3.8|EQAT|GPTQ|BitBLAS| Llama-3-70B-Instruct|fp16|5.33|73.78|137.8|-| Llama-3-70B-Instruct|w4g128|5.35|73.47|38.9|EQAT|GPTQ|BitBLAS| Llama-3-70B-Instruct|w3g128|5.65|72.87|32.2|EQAT| Llama-3-70B-Instruct|w2g64|7.86|67.64|23.2|EQAT|GPTQ|BitBLAS| Llama-3-70B-Instruct|w2g128|8.14|67.54|22.0|EQAT|GPTQ|BitBLAS| Mistral-Large-Instruct-2407|fp16|2.74|77.76|228.5|-| Mistral-Large-Instruct-2407|w2g64|5.58|73.54|35.5|GPTQ

Training

EfficientQAT involves two consecutive training phases: Block-wise training of all parameters (Block-AP) and end-to-end training of quantization parameters (E2E-QP). The detailed training script can be found in ./examples. We give the training script examples on Llama-2-7B with w2g64 quantization in the following.

Block-AP

You should modify --model to the folder of full-precision model in the script before you running the following command.

bash examples/block_ap/Llama-2-7b/w2g64.sh

Specifically, the --weight_lr is 2e-5 for 2-bit and 1e-5 for 3-/4-bits in our experiments.

Some other important argum

EfficientQAT

Install / Use

README

EfficientQAT

News

Contents

Installation

Model Zoo

Training