SINQ
Welcome to the official repository of SINQ! A novel, fast and high-quality quantization method designed to make any Large Language Model smaller while preserving accuracy.
Install / Use
/learn @huawei-csl/SINQREADME
⚡️ A fast, plug-and-play, model-agnostic quantization technique delivering state-of-the-art performance for Large Language Models without sacrificing accuracy.
💡 Want to run a large model on your GPU but don’t have enough memory? With SINQ, you can deploy models that would otherwise be too big drastically reducing memory usage while preserving LLM quality.
⏱️ SINQ quantizes Qwen3-14B in just ~21 sec and DeepSeekV2.5-236B in ~5 min
News:
🆕 [18/02/2025] SINQ is now integrated into HF Transformers! 🤗
You can now use SINQ in 🤗 Transformers in a super simplified way thanks to our SinqConfig compatible with HF AutoModelForCausalLM()!
More information directly on the HF website here!
🆕 [10/02/2026] A first GGUF model with pre-SINQ! 🤗
First GGUF model using pre-SINQ available in our collection huawei-csl/PreSINQ GGUF collection!
Thanks to our new pre-SINQ algorithm (see details here), we can finally bring the strengths of SINQhorn normalization together with the advantages of GGUF quantization! Many more models coming soon!
You can vote for the next SINQ GGUF model here!
🚀 Welcome to the official SINQ repository!
SINQ (Sinkhorn-Normalized Quantization) is a novel, fast and high-quality quantization method designed to make any Large Language Models smaller while keeping their accuracy almost intact.
🔍 What You’ll Find Here
- 1. How does SINQ work?
- 2. Why should I use SINQ?
- <u>3. Quantize (and save) any LLM with SINQ</u>
- 4. Run pre-quantized SINQ models from Hugging Face
- 5. How to reproduce paper results
- 6. Pre-SINQ: SINQhorn normalization for GGUFs (and more)!
- 7. Ongoing updates on new features and integrations
- 8. How to Cite This Work
- 9. Related Repositories
📊 Feature Comparison: SINQ vs HQQ (calibration-free) and A-SINQ vs AWQ (calibrated)
| Feature | SINQ | HQQ | A-SINQ | AWQ | |------------|:--------:|:--------:|:----------:|:-------:| | 🎯 Calibration | Calibration-free | Calibration-free | Calibrated | Calibrated | | 🧮 Quantization Type | Symmetric & Asymmetric | Asymmetric only | Symmetric & Asymmetric | Symmetric & Asymmetric | | 📦 NF4 Support | Yes | No | Yes | No | | ⚡ Quantization Speed | ~2× Faster than HQQ | Slower | ~4× Faster than AWQ | Slower | | 📈 Model Quality | Higher | Lower | Higher | Lower |
📄 Want to know more? Read our paper on arXiv!
1. How does SINQ work?
<details> <summary>Click to expand a quick explanation of SINQ’s core idea</summary>1️⃣ Dual-Scaling for Better Quantization
<p align="left"> <img src="imgs/dualscale.png" alt="Dual Scale Illustration" width="330" align="right" style="margin-left: 20px;"/> </p>Conventional quantization uses one scale per weight dimension, which makes models vulnerable to outliers: large weights that distort scaling and cause significant errors.
SINQ solves this by introducing dual scaling: separate scale factors for rows and columns. This flexibility redistributes outlier influence and keeps quantization errors smaller and more balanced.
2️⃣ More Even Error Distribution
<p align="left"> <img src="imgs/error.png" alt="Error Distribution Comparison" width="370" align="right" style="margin-left: 20px;"/> </p>With standard single-scale quantization, errors tend to cluster around outliers.
With SINQ, they become spread out and less severe, preserving model accuracy even at 3 bit precision. This improvement is driven by SINQ’s Sinkhorn-normalized optimization, which iteratively rescales rows and columns to balance their variance - a process inspired by Sinkhorn matrix normalization. By reducing the overall matrix imbalance (refer to the paper for more info), weights become inherently easier to quantize, leading to more stable behavior across layers and consistently higher accuracy even at very low bit-widths.
2. Why should I use SINQ?
<details> <summary>Click to expand a quick explanation on why you should use SINQ to quantize your LLM</summary>SINQ (calibration-free)
- Higher LLM quality and ~2× faster quantization than HQQ
- >31× faster quantization process and comparable or better LLM quality compared to AWQ / GPTQ
- Model-agnostic: works without knowing the specific LLM architecture, unlike QuaRot
- Training-free: it does not require end-to-end training, unlike SpinQuant or KurTail
- Additionally, A-SINQ (calibrated) further beats AWQ, GPTQ, and Hadamard+GPTQ on quality while achieving >4× faster quantization time.
Example
- ⏱️ SINQ quantizes Qwen3-14B in just ~21 sec and DeepSeekV2.5-236B in ~5 min on a single GPU
- 💾 Enables you to run DeepSeekV2.5-236B on a single GPU with ~110 GB of memory (vs ~472 GB) while losing < 1 ppl on WikiText2 and C4
3. Quantize any LLM with SINQ
<details> <summary><strong>Option 1) Directly run with HF Transformers</strong></summary> <br>There are two ways to use SINQ: directly through the Hugging Face Transformers integration, or by cloning this repository and using the full SINQ implementation.
Since SINQ is now integrated into 🤗 Hugging Face Transformers (more info here), you can quantize models directly using the native Transformers API without installing SINQ separately (SINQ only, ASINQ is not supported on HF).
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, SinqConfig
model_name = "Qwen/Qwen3-1.7B"
# Create SINQ quantization config
quant_cfg = SinqConfig(
nbits=4,
group_size=64,
modules_to_not_convert=["lm_head"],
)
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Load and quantize model in one step
qmodel = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=quant_cfg,
dtype=torch.bfloat16,
)
# Model is ready for inference
This uses the built-in Transformers integration and it requires:
pip install sinq #sinq.__version__ >= 0.1.7.post1
</details>
Option 2) SINQ via repo cloning
First, clone the repository and install the dependencies:
# 1. Clone the repository
git clone https://github.com/huawei-csl/SINQ.git
cd SINQ
# 2. Install dependencies
pip install -r req.txt
# 3. Install SINQ
pip install .
Quantize in a few lines
Quantizing any 🤗 Hugging Face model with SINQ is simple and takes only a few lines of code:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from sinq.patch_model import AutoSINQHFModel
from sinq.sinqlinear import BaseQuantizeConfig
model_name = "Qwen/Qwen3-1.7B"
device = "cuda:0"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_name)
quant_cfg = BaseQuantizeConfig(
nbits=4, # quantization bit-width
group_size=64, # group size
tiling_mode="1D", # tiling strategy
method="sinq" # quantization method ("asinq" for the calibrated version)
)
qmodel = AutoSINQHFModel.quantize_model(
model,
tokenizer=tokenizer,
quant_config=quant_cfg,
compute_dtype=torch.bfloat16,
device=device
)
✅ That’s it. Your model is now quantized with SINQ and ready for inference or saving.
Optional Flags
You can further customize the quantization process to balance accuracy and memory for your needs.
Here’s a summary of the main arguments you can tune:
| Flag | Description | Options | Default |
|------|-------------|---------|----------|
| --nbits | Bit-width for weight quantization | 2, 3, 4, 5, 6, 8 | 4 |
| --tiling_mode | Weight matrix tiling strategy | 1D, 2D | 1D |
| --group_size | Weights per quantization group | 64, 128 | 64 |
| --method | Quantization method | sinq, asinq | sinq |
💡 Tip: For most cases, the defaults (`--nbits 4 --tiling
