SkillAgentSearch skills...

GPTQModel

LLM model quantization (compression) toolkit with hw acceleration support for Nvidia CUDA, AMD ROCm, Intel XPU and Intel/AMD/Apple CPU via HF, vLLM, and SGLang.

Install / Use

/learn @ModelCloud/GPTQModel
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<p align=center> <div align=center> <img src="https://github.com/user-attachments/assets/ab70eb1e-06e7-4dc9-83e5-bd562e1a78b2" width=500> </div> <h1 align="center">GPT-QModel</h1> </p> <p align="center">LLM model quantization (compression) toolkit with hw acceleration support for Nvidia CUDA, AMD ROCm, Intel XPU, and Intel/AMD/Apple CPUs via HF, vLLM, and SGLang.</p> <p align="center"> <a href="https://github.com/ModelCloud/GPTQModel/releases" style="text-decoration:none;"><img alt="GitHub release" src="https://img.shields.io/github/release/ModelCloud/GPTQModel.svg"></a> <a href="https://pypi.org/project/gptqmodel/" style="text-decoration:none;"><img alt="PyPI - Version" src="https://img.shields.io/pypi/v/gptqmodel"></a> <a href="https://pepy.tech/projects/gptqmodel" style="text-decoration:none;"><img src="https://static.pepy.tech/badge/gptqmodel" alt="PyPI Downloads"></a> <a href="https://github.com/ModelCloud/GPTQModel/blob/main/LICENSE"><img src="https://img.shields.io/pypi/l/gptqmodel"></a> <a href="https://huggingface.co/modelcloud/"><img src="https://img.shields.io/badge/🤗%20Hugging%20Face-ModelCloud-%23ff8811.svg"></a> <a href="https://huggingface.co/models?search=gptq"> <img alt="Huggingface - Models" src="https://img.shields.io/badge/🤗_6.7K_gptq_models-8A2BE2"> </a> <a href="https://huggingface.co/models?search=awq"> <img alt="Huggingface - Models" src="https://img.shields.io/badge/🤗_8.2K_awq_models-8A2BE2"> </a> </p>

Latest News

  • 03/22/2026 [6.0-dev main]: ✨New quantization methods: ParoQuant, GGUF, FP8, EXL3. main is currently undergoing a major refactor and api is unstable.
  • 03/19/2026 5.8.0: ✨HF Transformers 5.3.0 support with auto-defusing of fused models via pypi pkg: Defuser. Qwen 3.5 family support added. New fast HF cpu kernels for GPTQ/AWQ added. Experimental INT8 cpu kernel added for GPTQ.
  • 03/09/2026 [main]: ✨Qwen 3.5 MoE model support added. New HF Kernel support added for AWQ. HF Kernel for both gptq/awq are now used by default for cpu devices for best performance. New INT8 kernel ported from Intel for gptq.
  • 02/28/2026 [main]: ✨Qwen 3.5 model support added.
  • 02/09/2026 5.7.0: ✨New MoE.Routing config with Bypass and Override options to allow multiple brute-force MoE routing controls for higher quality quantization of MoE experts. Combined with FailSafeStrategy, GPT-QModel now has three separate control settings for efficient MoE expert quantization. AWQ qcfg.zero_point property has been merged with a unified sym symmetry property; zero_point=True is now sym=False. Fixed AWQ sym=True packing/inference and quantization compatibility with some Qwen3 models. Exaone 4.0 support.
<details> <summary>Archived News</summary> * 12/31/2025 5.7.0-dev: ✨New `FailSafe` config and `FailSafeStrategy`, auto-enabled by default, to address uneven routing of MoE experts resulting in quantization issues for some MoE modules. `Smooth` operations are introduced to `FailSafeStrategy` to reduce the impact of outliers in `FailSafe` quantization using `RTN` by default. Different `FailSafeStrategy` and `Smoothers` can be selected. `Threshold` to activate `FailSafe` can also be customized. New Voxtral and Glm-4v model support, plus audio dataset calibration for Qwen2-Omni. `AWQ` compatibility fix for `GLM 4.5-Air`.
  • 12/17/2025 5.6.2-12 Patch: Fixed uv compatibility. Both uv and pip installs will now show UI progress for external wheel/dependency downloads. Fixed MacOS and AWQMarlin kernel loading import regressions. Resolved most multi-arch compile issues on Ubuntu, Arch, RedHat and other distros. Fixed multi-arch build issues and Tritonv2 kernel launch bug on multi-GPUs. Fixed 3-bit Triton GPTQ kernel dequant/inference and license property compatibility issue with latest pip/setuptools.

  • 12/9/2025 5.6.0: ✨New HF Kernel for CPU optimized for AMX, AVX2 and AVX512. Auto module tree for auto-model support. Added Afmoe and Dosts1 model support. Fixed pre-layer pass quantization speed regression. Improved HF Transformers, Peft and Optimum support for both GPTQ and AWQ. Fixed many AWQ compatibility bugs and regressions.

  • 11/9/2025 5.4.0: ✨New Intel CPU and XPU hardware-optimized AWQ TorchFusedAWQ kernel. Torch Fused kernels now compatible with torch.compile. Fixed AWQ MoE model compatibility and reduced VRAM usage.

  • 11/3/2025 5.2.0: ✨Minimax M2 support with ModelCloud BF16 M2 Model. New VramStrategy.Balanced quantization property for reduced memory usage for large MoE on multi-3090 (24GB) devices. ✨Marin model. New AWQ Torch reference kernel. Fixed AWQ Marlin kernel for bf16. Fixed GLM 4.5/4.6 MoE missing mtp layers on model save (HF bug). Modular refactor. 🎉AWQ support out of beta with full feature support including multi-GPU quant and MoE VRAM saving. ✨Brumby (attention free) model support. ✨IBM Granite Nano support. New calibration_concat_separator config option.

  • 10/24/2025 5.0.0: 🎉 Data-parallel quant support for MoE models on multi-GPU using nogil Python. offload_to_disk support enabled by default to massively reduce CPU RAM usage. New Intel and AMD CPU hardware-accelerated TorchFused kernel. Packing stage is now 4x faster and now inlined with quantization. VRAM pressure for large models reduced during quantization. act_group_aware is 16k+ times faster and now the default when desc_act=False for higher quality recovery without inference penalty of desc_act=True. New beta quality AWQ support with full gemm, gemm_fast, marlin kernel support. LFM, Ling, Qwen3 Omni model support. Bitblas kernel updated to support Bitblas 0.1.0.post1 release. Quantization is now faster with reduced VRAM usage. Enhanced logging support with LogBar.

  • 09/16/2025 4.2.5: hyb_act renamed to act_group_aware. Removed finicky torch import within setup.py. Packing bug fix and prebuilt PyTorch 2.8 wheels.

  • 09/12/2025 4.2.0: ✨ New Models Support: Qwen3-Next, Apertus, Kimi K2, Klear, FastLLM, Nemotron H. New fail_safe boolean toggle to .quantize() to patch-fix non-activated MoE modules due to highly uneven MoE model training. Fixed LavaQwen2 compatibility. Patch-fixed GIL=0 CUDA error for multi-GPU. Fixed compatibility with autoround + new transformers.

  • 09/04/2025 4.1.0: ✨ Meituan LongCat Flash Chat, Llama 4, GPT-OSS (BF16), and GLM-4.5-Air support. New experimental mock_quantization config to skip complex computational code paths during quantization to accelerate model quant testing.

  • 08/21/2025 4.0.0: 🎉 New Group Aware Reordering (GAR) support. New models support: Bytedance Seed-OSS, Baidu Ernie, Huawei PanGu, Gemma3, Xiaomi Mimo, Qwen 3/MoE, Falcon H1, GPT-Neo. Memory leak and multiple model compatibility fixes related to Transformers >= 4.54. Python >= 3.13t free-threading support added with near N x GPU linear scaling for quantization of MoE models and also linear N x CPU core scaling of packing stage. Early access PyTorch 2.8 fused-ops on Intel XPU for up to 50% speedup.

  • 10/17/2025 5.0.0-dev main: 👀: EoRA now multi-GPU compatible. Fixed both quality stability in multi-GPU quantization and VRAM usage. New LFM and Ling models support.

  • 09/30/2025 5.0.0-dev main: 👀: New Data Parallel + Multi-GPU + Python 3.13T (PYTHON_GIL=0) equals 80%+ overall quant time reduction of large MoE models vs v4.2.5.

  • 09/29/2025 5.0.0-dev main: 🎉 New Qwen3 Omni model support. AWQ Marlin kernel integrated + many disk offload, threading, and memory usage fixes.

  • 09/24/2025 5.0.0-dev main: 🎉 Up to 90% CPU memory saving for large MoE models with faster/inline packing! 26% quant time reduction for Qwen3 MoE! AWQ Marlin kernel added. AWQ Gemm loading bug fixes. act_group_aware now faster and auto enabled for GPTQ when desc_act is False for higher quality recovery.

  • 09/19/2025 5.0.0-dev main: 👀 CPU memory saving of ~73.5% during quantization stage with new offload_to_disk quantization config property defaults to True.

  • 09/18/2025 5.0.0-dev main: 🎉 AWQ quantization support! Complete refactor and simplification of model definitions in preparation for future quantization formats.

  • 08/19/2025 4.0.0-dev main: Fixed quantization memory usage due to some models' incorrect application of config.use_cache during inference. Fixed Transformers >= 4.54.0 compatibility which changed layer forward return signature for some models.

  • 08/18/2025 4.0.0-dev main: GPT-Neo model support. Memory leak fix in error capture (stack trace) and fixed lm_head quantization compatibility for many models.

  • 07/31/2025 4.0.0-dev main: New Group Aware Reordering (GAR) support and preliminary PyTorch 2.8 fused-ops for Intel XPU for up to 50% speedup.

  • 07/03/2025 4.0.0-dev main: New Baidu Ernie and Huawei PanGu model support.

  • 07/02/2025 4.0.0-dev main: Gemma3 4B model compatibility fix.

  • 05/29/2025 4.0.0-dev main: Falcon H1 model support. Fixed Transformers 4.52+ compatibility with Qwen 2.5 VL models.

  • 05/19/2025 4.0.0-dev main: Qwen 2.5 Omni model support.

  • 05/05/2025 4.0.0-dev main: Python 3.13t free-threading support added with n

View on GitHub
GitHub Stars1.1k
CategoryCustomer
Updated17h ago
Forks168

Languages

Python

Security Score

85/100

Audited on Mar 27, 2026

No findings