Deepcompressor
Model Compression Toolbox for Large Language Models and Diffusion Models
Install / Use
/learn @nunchaku-ai/DeepcompressorREADME
News
- [2025/02] 🎉 QServe has been accepted to MLSys 2025!
- [2025/01] 🎉 SVDQuant has been accepted to ICLR 2025 (Spotlight)!
- [2024/12] 🎉 QServe has been integratedd into NVIDIA TensorRT-LLM!
- [2024/11] 🔥 Our latest W4A4 diffusion model quantization work SVDQuant algorithm and Nunchaku system is publicly released! Check our paper!
- [2024/05] 🔥 Our latest W4A8KV4 LLM quantization work QoQ algorithm and QServe system is publicly released! QoQ is short for quattuor-octō-quattuor which is 4-8-4 in latin. Check our paper!
Key Features
DeepCompressor is an open source model compression toolbox for large language models and diffusion models based on PyTorch. DeepCompressor currently supports fake quantization with any integer and floating-point data type within 8 bits, e.g., INT8, INT4 and FP4_E2M1. Here are examples that implement the following algorithms.
- Post-training quantization for large language models:
- Weight-only Quantization
- Weight-Activation Quantization
- Weight-Activation and KV-Cache Quantization
- Post-training quantization for diffusion models:
- Weight-Activation Quantization
DeepCompressor also contains examples that integrate with other inference libraries.
- Deploy weight-only quantized LLMs with TinyChat
- Deploy quantized LLMs with QServe
- Deploy quantized diffusion models with Nunchaku
Installation
Install from Source
- Clone this repository and navigate to deepcompressor folder
git clone https://github.com/mit-han-lab/deepcompressor
cd deepcompressor
- Install Package
conda env create -f environment.yml
poetry install
Highlights
SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models
[Website][Paper][Nunchaku Inference System]
Diffusion models have been proven highly effective at generating high-quality images. However, as these models grow larger, they require significantly more memory and suffer from higher latency, posing substantial challenges for deployment. In this work, we aim to accelerate diffusion models by quantizing their weights and activations to 4 bits. At such an aggressive level, both weights and activations are highly sensitive, where conventional post-training quantization methods for large language models like smoothing become insufficient. To overcome this limitation, we propose SVDQuant, a new 4-bit quantization paradigm. Different from smoothing which redistributes outliers between weights and activations, our approach absorbs these outliers using a low-rank branch. We first consolidate the outliers by shifting them from activations to weights, then employ a high-precision low-rank branch to take in the weight outliers with Singular Value Decomposition (SVD). This process eases the quantization on both sides. However, naïvely running the low-rank branch independently incurs significant overhead due to extra data movement of activations, negating the quantization speedup. To address this, we co-design an inference engine Nunchaku that fuses the kernels of the low-rank branch into those of the low-bit branch to cut off redundant memory access. It can also seamlessly support off-the-shelf low-rank adapters (LoRAs) without the need for re-quantization. Extensive experiments on SDXL, PixArt-∑, and FLUX.1 validate the effectiveness of SVDQuant in preserving image quality. We reduce the memory usage for the 12B FLUX.1 models by 3.5×, achieving 3.0× speedup over the 4-bit weight-only quantized baseline on the 16GB laptop 4090 GPU, paving the way for more interactive applications on PCs.

Quality Evaluation
Below is the quality and similarity evaluated with 5000 samples from MJHQ-30K dataset. IR means ImageReward. Our 4-bit results outperform other 4-bit baselines, effectively preserving the visual quality of 16-bit models.
| Model | Precision | Method | FID ($\downarrow$) | IR ($\uparrow$) | LPIPS ($\downarrow$) | PSNR( $\uparrow$) | |----------------------------|-----------|-----------|--------------------|-----------------|----------------------|-------------------| | FLUX.1-dev (50 Steps) | BF16 | -- | 20.3 | 0.953 | -- | -- | | | W4A16 | NF4 | 20.6 | 0.910 | 0.272 | 19.5 | | | INT W4A4 | | 20.2 | 0.908 | 0.322 | 18.5 | | | INT W4A4 | SVDQuant | 19.9 | 0.935 | 0.223 | 21.0 | | | NVFP4 | | 20.3 | 0.961 | 0.345 | 16.3 | | | NVFP4 | SVDQuant | 20.3 | 0.945 | 0.205 | 21.5 | | FLUX.1-schnell (4 Steps) | BF16 | -- | 19.2 | 0.938 | -- | -- | | | W4A16 | NF4 | 18.9 | 0.943 | 0.257 | 18.2 | | | INT W4A4 | | 18.1 | 0.962 | 0.345 | 16.3 | | | INT W4A4 | SVDQuant | 18.3 | 0.951 | 0.257 | 18.3 | | | NVFP4 | | 19.0 | 0.952 | 0.276 | 17.6 | | | NVFP4 | SVDQuant | 18.9 | 0.966 | 0.228 | 19.0 | | SANA-1.6b (20 Steps) | BF16 | -- | 20.6 | 0.952 | -- | -- | | | INT W4A4 | | 20.5 | 0.894 | 0.339 | 15.3 | | | INT W4A4 | SVDQuant | 19.3 | 0.935 | 0.220 | 17.8 | | | NVFP4 | | 19.7 | 0.929 | 0.236 | 17.4 | | | NVFP4 | SVDQuant | 20.2 | 0.941 | 0.176 | 19.0 | | PixArt-Sigma (20 Steps) | FP16 | -- | 16.6 | 0.944 | -- | -- | | | INT W4A8 | ViDiT-Q | 37.3 | 0.573 | 0.611 | 12.0 | | | INT W4A4 | SVDQuant | 19.2 | 0.878 | 0.323 | 17.6 | | | NVFP4 | | 31.8 | 0.660 | 0.517 | 14.8 | | | NVFP4 | SVDQuant | 16.6 | 0.940 | 0.271 | 18.5 |
QServe: W4A8KV4 Quantization for Efficient LLM Serving
[Website][Paper][QoQ Algorithm Code][QServe GPU System]
Quantization can accelerate large language model (LLM) inference. Going beyond INT8 quantization, the research community is actively exploring even lower precision, such as INT4. Nonetheless, state-of-the-art INT4 quantization techniques only accelerate low-batch, edge LLM inference, failing to deliver performance gains in large-batch, cloud-based LLM serving. We uncover a critical issue: existing INT4 quantization methods suffer from significant runtime overhead (20-90%) when dequantizing either weights or partial sums on GPUs. To address this challenge, we introduce QoQ, a W4A8KV4 quantization algorithm with 4-bit weight, 8-bit activation, and 4-bit KV cache. QoQ stands for quattuor-octo-quattuor, which represents 4-8-4 in Latin. QoQ is implemented by the QServe inference library that achieves measured speedup. The key insight driving QServe is
Related Skills
node-connect
351.8kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
110.9kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
351.8kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
351.8kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
