GPTQModel
LLM model quantization (compression) toolkit with hw acceleration support for Nvidia CUDA, AMD ROCm, Intel XPU and Intel/AMD/Apple CPU via HF, vLLM, and SGLang.
Install / Use
/learn @ModelCloud/GPTQModelREADME
Latest News
- 03/22/2026 [6.0-dev
main]: ✨New quantization methods:ParoQuant,GGUF,FP8,EXL3.mainis currently undergoing a major refactor and api is unstable. - 03/19/2026 5.8.0: ✨HF Transformers 5.3.0 support with auto-defusing of
fusedmodels via pypi pkg: Defuser. Qwen 3.5 family support added. New fast HFcpukernels for GPTQ/AWQ added. Experimental INT8cpukernel added for GPTQ. - 03/09/2026 [main]: ✨Qwen 3.5 MoE model support added. New HF Kernel support added for AWQ. HF Kernel for both gptq/awq are now used by default for cpu devices for best performance. New INT8 kernel ported from Intel for gptq.
- 02/28/2026 [main]: ✨Qwen 3.5 model support added.
- 02/09/2026 5.7.0: ✨New
MoE.Routingconfig withBypassandOverrideoptions to allow multiple brute-force MoE routing controls for higher quality quantization of MoE experts. Combined withFailSafeStrategy, GPT-QModel now has three separate control settings for efficient MoE expert quantization.AWQqcfg.zero_pointproperty has been merged with a unifiedsymsymmetry property;zero_point=Trueis nowsym=False. FixedAWQsym=Truepacking/inference and quantization compatibility with some Qwen3 models. Exaone 4.0 support.
-
12/17/2025 5.6.2-12 Patch: Fixed
uvcompatibility. Bothuvandpipinstalls will now show UI progress for external wheel/dependency downloads. FixedMacOSandAWQMarlinkernel loading import regressions. Resolved mostmulti-archcompile issues onUbuntu,Arch,RedHatand other distros. Fixedmulti-archbuild issues andTritonv2kernel launch bug on multi-GPUs. Fixed 3-bit Triton GPTQ kernel dequant/inference andlicenseproperty compatibility issue with latest pip/setuptools. -
12/9/2025 5.6.0: ✨New
HF Kernelfor CPU optimized forAMX,AVX2andAVX512. Auto module tree for auto-model support. Added Afmoe and Dosts1 model support. Fixed pre-layer pass quantization speed regression. Improved HF Transformers, Peft and Optimum support for both GPTQ and AWQ. Fixed many AWQ compatibility bugs and regressions. -
11/9/2025 5.4.0: ✨New Intel CPU and XPU hardware-optimized AWQ
TorchFusedAWQkernel. Torch Fused kernels now compatible withtorch.compile. Fixed AWQ MoE model compatibility and reduced VRAM usage. -
11/3/2025 5.2.0: ✨Minimax M2 support with ModelCloud BF16 M2 Model. New
VramStrategy.Balancedquantization property for reduced memory usage for large MoE on multi-3090 (24GB) devices. ✨Marin model. New AWQ Torch reference kernel. Fixed AWQ Marlin kernel for bf16. Fixed GLM 4.5/4.6 MoE missingmtplayers on model save (HF bug). Modular refactor. 🎉AWQ support out of beta with full feature support including multi-GPU quant and MoE VRAM saving. ✨Brumby (attention free) model support. ✨IBM Granite Nano support. Newcalibration_concat_separatorconfig option. -
10/24/2025 5.0.0: 🎉 Data-parallel quant support for
MoEmodels on multi-GPU usingnogilPython.offload_to_disksupport enabled by default to massively reduceCPURAM usage. NewIntelandAMDCPU hardware-acceleratedTorchFusedkernel. Packing stage is now 4x faster and now inlined with quantization.VRAMpressure for large models reduced during quantization.act_group_awareis 16k+ times faster and now the default whendesc_act=Falsefor higher quality recovery without inference penalty ofdesc_act=True. New beta qualityAWQsupport with fullgemm,gemm_fast,marlinkernel support.LFM,Ling,Qwen3 Omnimodel support.Bitblaskernel updated to support Bitblas0.1.0.post1release. Quantization is now faster with reduced VRAM usage. Enhanced logging support withLogBar. -
09/16/2025 4.2.5:
hyb_actrenamed toact_group_aware. Removed finickytorchimport withinsetup.py. Packing bug fix and prebuilt PyTorch 2.8 wheels. -
09/12/2025 4.2.0: ✨ New Models Support: Qwen3-Next, Apertus, Kimi K2, Klear, FastLLM, Nemotron H. New
fail_safebooleantoggle to.quantize()to patch-fix non-activatedMoEmodules due to highly uneven MoE model training. Fixed LavaQwen2 compatibility. Patch-fixed GIL=0 CUDA error for multi-GPU. Fixed compatibility with autoround + new transformers. -
09/04/2025 4.1.0: ✨ Meituan LongCat Flash Chat, Llama 4, GPT-OSS (BF16), and GLM-4.5-Air support. New experimental
mock_quantizationconfig to skip complex computational code paths during quantization to accelerate model quant testing. -
08/21/2025 4.0.0: 🎉 New Group Aware Reordering (GAR) support. New models support: Bytedance Seed-OSS, Baidu Ernie, Huawei PanGu, Gemma3, Xiaomi Mimo, Qwen 3/MoE, Falcon H1, GPT-Neo. Memory leak and multiple model compatibility fixes related to Transformers >= 4.54. Python >= 3.13t free-threading support added with near N x GPU linear scaling for quantization of MoE models and also linear N x CPU core scaling of packing stage. Early access PyTorch 2.8 fused-ops on Intel XPU for up to 50% speedup.
-
10/17/2025 5.0.0-dev
main: 👀: EoRA now multi-GPU compatible. Fixed both quality stability in multi-GPU quantization and VRAM usage. New LFM and Ling models support. -
09/30/2025 5.0.0-dev
main: 👀: New Data Parallel + Multi-GPU + Python 3.13T (PYTHON_GIL=0) equals 80%+ overall quant time reduction of large MoE models vs v4.2.5. -
09/29/2025 5.0.0-dev
main: 🎉 New Qwen3 Omni model support. AWQ Marlin kernel integrated + many disk offload, threading, and memory usage fixes. -
09/24/2025 5.0.0-dev
main: 🎉 Up to 90% CPU memory saving for large MoE models with faster/inline packing! 26% quant time reduction for Qwen3 MoE! AWQ Marlin kernel added. AWQ Gemm loading bug fixes.act_group_awarenow faster and auto enabled for GPTQ whendesc_actis False for higher quality recovery. -
09/19/2025 5.0.0-dev
main: 👀 CPU memory saving of ~73.5% during quantization stage with newoffload_to_diskquantization config property defaults toTrue. -
09/18/2025 5.0.0-dev
main: 🎉 AWQ quantization support! Complete refactor and simplification of model definitions in preparation for future quantization formats. -
08/19/2025 4.0.0-dev
main: Fixed quantization memory usage due to some models' incorrect application ofconfig.use_cacheduring inference. FixedTransformers>= 4.54.0 compatibility which changed layer forward return signature for some models. -
08/18/2025 4.0.0-dev
main: GPT-Neo model support. Memory leak fix in error capture (stack trace) and fixedlm_headquantization compatibility for many models. -
07/31/2025 4.0.0-dev
main: New Group Aware Reordering (GAR) support and preliminary PyTorch 2.8 fused-ops for Intel XPU for up to 50% speedup. -
07/03/2025 4.0.0-dev
main: New Baidu Ernie and Huawei PanGu model support. -
07/02/2025 4.0.0-dev
main: Gemma3 4B model compatibility fix. -
05/29/2025 4.0.0-dev
main: Falcon H1 model support. Fixed Transformers4.52+compatibility with Qwen 2.5 VL models. -
05/19/2025 4.0.0-dev
main: Qwen 2.5 Omni model support. -
05/05/2025 4.0.0-dev
main: Python 3.13t free-threading support added with n
