EfficientQAT
[ACL 2025 Main] EfficientQAT: Efficient Quantization-Aware Training for Large Language Models
Install / Use
/learn @OpenGVLab/EfficientQATREADME
EfficientQAT
Official PyTorch implement of paper EfficientQAT: Efficient Quantization-Aware Training for Large Language Models
News
- [2025/11] 🔥 We open-source INT vs. FP, a framework to compare low-bit integer and float-point formats, including MXFP8/MXFP6/MXFP4/NVFP4 and MXINT8/MXINT6/MXINT4/NVINT4.
- [2025/05] 🔥 We explore the Scaling Law for Quantization-Aware Training, which offers insights and instruction for LLMs QAT.
- [2025/05] 🌟 Our EfficientQAT paper has been accepted for ACL 2025 Main Conference! 🎉 Cheers!
- [2024/10] 🔥 We release a new weight-activation quantization algorithm, PrefixQuant, which proposed an efficient method to isolate sink token (token-wise outlier).
- [2024/08] The new inference backend T-MAC from Microsoft has supported EffcientQAT models.
- [2024/08] We support for the quantization of Mistral-Large-Instruct. W2g64 Mistral-Large-Instruct with our EfficientQAT can compress the 123B models to 35 GB with only 4 points accuracy degeneration.
- [2024/07] New featurs! We support to transfer EfficientQAT quantized models into
GPTQ v2format andBitBLASformat, which can be directly loaded through GPTQModel. - [2024/07] We release EfficientQAT, which pushes the limitation of uniform (INT) quantization in an efficient manner.
Contents
Installation
- Clone this repository and navigate to EfficientQAT folder
git clone https://github.com/OpenGVLab/EfficientQAT.git
cd EfficientQAT
- Install package
conda create -n efficientqat python==3.11
conda activate efficientqat
pip install -r requirements.txt
Model Zoo
We provide a number of prequantized EfficientQAT models as follows:
- WikiText2 PPL is measured in 2048 context length.
- Avg. Accuracy indicate the average accuracy in 5 zero-shot reasoning tasks (WinoGrande,PIQA,HellaSwag,Arc-Easy, Arc-Challenge) with lm-eval v0.4.2.
- 1GB = $10^9$ Bit
- Hub Link: EQAT indicates the original checkpoints. We also transfer the checkpoints into GPTQ and BitBLAS formats, which can be loaded directly through GPTQModel. (PS: GPTQModel is a official bug-fixed repo of AutoGPTQ, which would be merged into AutoGPTQ in future.)
| Model | Quantization | WikiText2 PPL | Avg. Accuracy | Model Size (GB) | Hub link| |-------|--------------|---------------|---------------|-----------------|----------| Llama-2-7B|fp16|5.47|64.86|13.2|-| Llama-2-7B|w4g128|5.53|64.27|3.7|EQAT|GPTQ|BitBLAS| Llama-2-7B|w3g128|5.81|64.02|3.1|EQAT| Llama-2-7B|w2g64|6.86|60.14|2.3|EQAT|GPTQ|BitBLAS| Llama-2-7B|w2g128|7.17|59.50|2.2|EQAT|GPTQ|BitBLAS| Llama-2-13B|fp16|4.88|67.81|25.4|-| Llama-2-13B|w4g128|4.93|67.52|6.8|EQAT|GPTQ|BitBLAS| Llama-2-13B|w3g128|5.12|67.28|5.6|EQAT| Llama-2-13B|w2g64|5.96|64.88|4.0|EQAT|GPTQ|BitBLAS| Llama-2-13B|w2g128|6.08|63.88|3.8|EQAT|GPTQ|BitBLAS| Llama-2-70B|fp16|3.32|72.41|131.6|-| Llama-2-70B|w4g128|3.39|72.62|35.8|EQAT|GPTQ|BitBLAS| Llama-2-70B|w3g128|3.61|71.76|29.1|EQAT| Llama-2-70B|w2g64|4.52|69.48|20.1|EQAT|GPTQ|BitBLAS| Llama-2-70B|w2g128|4.61|68.93|18.9|EQAT|GPTQ|BitBLAS| Llama-3-8B|fp16|6.14|68.58|13.0|-| Llama-3-8B|w4g128|6.47|68.43|5.4|EQAT|GPTQ|BitBLAS| Llama-3-8B|w3g128|7.09|67.35|4.7|EQAT| Llama-3-8B|w2g64|9.41|60.76|3.9|EQAT|GPTQ|BitBLAS| Llama-3-8B|w2g128|9.80|59.36|3.8|EQAT|GPTQ|BitBLAS| Llama-3-70B|fp16|2.85|75.33|137.8|-| Llama-3-70B|w4g128|3.17|74.57|38.9|EQAT|GPTQ|BitBLAS| Llama-3-70B|w3g128|4.19|72.42|32.2|EQAT| Llama-3-70B|w2g64|6.08|67.89|23.2|EQAT|GPTQ| Llama-3-70B|w2g128|6.38|67.57|22.0|EQAT|GPTQ|BitBLAS| Llama-3-8B-Instruct|fp16|8.29|68.43|13.0|-| Llama-3-8B-Instruct|w4g128|7.93|68.39|5.4|EQAT|GPTQ|BitBLAS| Llama-3-8B-Instruct|w3g128|8.55|67.24|4.7|EQAT| Llama-3-8B-Instruct|w2g64|11.19|60.66|3.9|EQAT|GPTQ|BitBLAS| Llama-3-8B-Instruct|w2g128|11.73|60.16|3.8|EQAT|GPTQ|BitBLAS| Llama-3-70B-Instruct|fp16|5.33|73.78|137.8|-| Llama-3-70B-Instruct|w4g128|5.35|73.47|38.9|EQAT|GPTQ|BitBLAS| Llama-3-70B-Instruct|w3g128|5.65|72.87|32.2|EQAT| Llama-3-70B-Instruct|w2g64|7.86|67.64|23.2|EQAT|GPTQ|BitBLAS| Llama-3-70B-Instruct|w2g128|8.14|67.54|22.0|EQAT|GPTQ|BitBLAS| Mistral-Large-Instruct-2407|fp16|2.74|77.76|228.5|-| Mistral-Large-Instruct-2407|w2g64|5.58|73.54|35.5|GPTQ
Training
EfficientQAT involves two consecutive training phases: Block-wise training of all parameters (Block-AP) and end-to-end training of quantization parameters (E2E-QP). The detailed training script can be found in ./examples. We give the training script examples on Llama-2-7B with w2g64 quantization in the following.
- Block-AP
You should modify --model to the folder of full-precision model in the script before you running the following command.
bash examples/block_ap/Llama-2-7b/w2g64.sh
Specifically, the --weight_lr is 2e-5 for 2-bit and 1e-5 for 3-/4-bits in our experiments.
Some other important argum
