SkillAgentSearch skills...

EfficientQAT

[ACL 2025 Main] EfficientQAT: Efficient Quantization-Aware Training for Large Language Models

Install / Use

/learn @OpenGVLab/EfficientQAT
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

EfficientQAT

Official PyTorch implement of paper EfficientQAT: Efficient Quantization-Aware Training for Large Language Models

News

  • [2025/11] 🔥 We open-source INT vs. FP, a framework to compare low-bit integer and float-point formats, including MXFP8/MXFP6/MXFP4/NVFP4 and MXINT8/MXINT6/MXINT4/NVINT4.
  • [2025/05] 🔥 We explore the Scaling Law for Quantization-Aware Training, which offers insights and instruction for LLMs QAT.
  • [2025/05] 🌟 Our EfficientQAT paper has been accepted for ACL 2025 Main Conference! 🎉 Cheers!
  • [2024/10] 🔥 We release a new weight-activation quantization algorithm, PrefixQuant, which proposed an efficient method to isolate sink token (token-wise outlier).
  • [2024/08] The new inference backend T-MAC from Microsoft has supported EffcientQAT models.
  • [2024/08] We support for the quantization of Mistral-Large-Instruct. W2g64 Mistral-Large-Instruct with our EfficientQAT can compress the 123B models to 35 GB with only 4 points accuracy degeneration.
  • [2024/07] New featurs! We support to transfer EfficientQAT quantized models into GPTQ v2 format and BitBLAS format, which can be directly loaded through GPTQModel.
  • [2024/07] We release EfficientQAT, which pushes the limitation of uniform (INT) quantization in an efficient manner.

Contents

Installation

  1. Clone this repository and navigate to EfficientQAT folder
git clone https://github.com/OpenGVLab/EfficientQAT.git
cd EfficientQAT
  1. Install package
conda create -n efficientqat python==3.11

conda activate efficientqat

pip install -r requirements.txt

Model Zoo

We provide a number of prequantized EfficientQAT models as follows:

  • WikiText2 PPL is measured in 2048 context length.
  • Avg. Accuracy indicate the average accuracy in 5 zero-shot reasoning tasks (WinoGrande,PIQA,HellaSwag,Arc-Easy, Arc-Challenge) with lm-eval v0.4.2.
  • 1GB = $10^9$ Bit
  • Hub Link: EQAT indicates the original checkpoints. We also transfer the checkpoints into GPTQ and BitBLAS formats, which can be loaded directly through GPTQModel. (PS: GPTQModel is a official bug-fixed repo of AutoGPTQ, which would be merged into AutoGPTQ in future.)

| Model | Quantization | WikiText2 PPL | Avg. Accuracy | Model Size (GB) | Hub link| |-------|--------------|---------------|---------------|-----------------|----------| Llama-2-7B|fp16|5.47|64.86|13.2|-| Llama-2-7B|w4g128|5.53|64.27|3.7|EQAT|GPTQ|BitBLAS| Llama-2-7B|w3g128|5.81|64.02|3.1|EQAT| Llama-2-7B|w2g64|6.86|60.14|2.3|EQAT|GPTQ|BitBLAS| Llama-2-7B|w2g128|7.17|59.50|2.2|EQAT|GPTQ|BitBLAS| Llama-2-13B|fp16|4.88|67.81|25.4|-| Llama-2-13B|w4g128|4.93|67.52|6.8|EQAT|GPTQ|BitBLAS| Llama-2-13B|w3g128|5.12|67.28|5.6|EQAT| Llama-2-13B|w2g64|5.96|64.88|4.0|EQAT|GPTQ|BitBLAS| Llama-2-13B|w2g128|6.08|63.88|3.8|EQAT|GPTQ|BitBLAS| Llama-2-70B|fp16|3.32|72.41|131.6|-| Llama-2-70B|w4g128|3.39|72.62|35.8|EQAT|GPTQ|BitBLAS| Llama-2-70B|w3g128|3.61|71.76|29.1|EQAT| Llama-2-70B|w2g64|4.52|69.48|20.1|EQAT|GPTQ|BitBLAS| Llama-2-70B|w2g128|4.61|68.93|18.9|EQAT|GPTQ|BitBLAS| Llama-3-8B|fp16|6.14|68.58|13.0|-| Llama-3-8B|w4g128|6.47|68.43|5.4|EQAT|GPTQ|BitBLAS| Llama-3-8B|w3g128|7.09|67.35|4.7|EQAT| Llama-3-8B|w2g64|9.41|60.76|3.9|EQAT|GPTQ|BitBLAS| Llama-3-8B|w2g128|9.80|59.36|3.8|EQAT|GPTQ|BitBLAS| Llama-3-70B|fp16|2.85|75.33|137.8|-| Llama-3-70B|w4g128|3.17|74.57|38.9|EQAT|GPTQ|BitBLAS| Llama-3-70B|w3g128|4.19|72.42|32.2|EQAT| Llama-3-70B|w2g64|6.08|67.89|23.2|EQAT|GPTQ| Llama-3-70B|w2g128|6.38|67.57|22.0|EQAT|GPTQ|BitBLAS| Llama-3-8B-Instruct|fp16|8.29|68.43|13.0|-| Llama-3-8B-Instruct|w4g128|7.93|68.39|5.4|EQAT|GPTQ|BitBLAS| Llama-3-8B-Instruct|w3g128|8.55|67.24|4.7|EQAT| Llama-3-8B-Instruct|w2g64|11.19|60.66|3.9|EQAT|GPTQ|BitBLAS| Llama-3-8B-Instruct|w2g128|11.73|60.16|3.8|EQAT|GPTQ|BitBLAS| Llama-3-70B-Instruct|fp16|5.33|73.78|137.8|-| Llama-3-70B-Instruct|w4g128|5.35|73.47|38.9|EQAT|GPTQ|BitBLAS| Llama-3-70B-Instruct|w3g128|5.65|72.87|32.2|EQAT| Llama-3-70B-Instruct|w2g64|7.86|67.64|23.2|EQAT|GPTQ|BitBLAS| Llama-3-70B-Instruct|w2g128|8.14|67.54|22.0|EQAT|GPTQ|BitBLAS| Mistral-Large-Instruct-2407|fp16|2.74|77.76|228.5|-| Mistral-Large-Instruct-2407|w2g64|5.58|73.54|35.5|GPTQ

Training

EfficientQAT involves two consecutive training phases: Block-wise training of all parameters (Block-AP) and end-to-end training of quantization parameters (E2E-QP). The detailed training script can be found in ./examples. We give the training script examples on Llama-2-7B with w2g64 quantization in the following.

  1. Block-AP

You should modify --model to the folder of full-precision model in the script before you running the following command.

bash examples/block_ap/Llama-2-7b/w2g64.sh

Specifically, the --weight_lr is 2e-5 for 2-bit and 1e-5 for 3-/4-bits in our experiments.

Some other important argum

View on GitHub
GitHub Stars333
CategoryDevelopment
Updated5d ago
Forks29

Languages

Python

Security Score

80/100

Audited on Mar 23, 2026

No findings