<table border="0" cellspacing="0" cellpadding="0"> <tr> <td><img src="imgs/logo_600.png" alt="SINQ Logo" width="160"></td> <td style="vertical-align: middle;"><h1>SINQ: Sinkhorn-Normalized Quantization for Calibration-Free Low-Precision LLMs</h1></td> </tr> </table>

⚡️ A fast, plug-and-play, model-agnostic quantization technique delivering state-of-the-art performance for Large Language Models without sacrificing accuracy.

💡 Want to run a large model on your GPU but don’t have enough memory? With SINQ, you can deploy models that would otherwise be too big drastically reducing memory usage while preserving LLM quality.

⏱️ SINQ quantizes Qwen3-14B in just ~21 sec and DeepSeekV2.5-236B in ~5 min

News:

🆕 [18/02/2025] SINQ is now integrated into HF Transformers! 🤗

You can now use SINQ in 🤗 Transformers in a super simplified way thanks to our SinqConfig compatible with HF AutoModelForCausalLM()!

More information directly on the HF website here!

🆕 [10/02/2026] A first GGUF model with pre-SINQ! 🤗

First GGUF model using pre-SINQ available in our collection huawei-csl/PreSINQ GGUF collection!

Thanks to our new pre-SINQ algorithm (see details here), we can finally bring the strengths of SINQhorn normalization together with the advantages of GGUF quantization! Many more models coming soon!

You can vote for the next SINQ GGUF model here!

🚀 Welcome to the official SINQ repository!

SINQ (Sinkhorn-Normalized Quantization) is a novel, fast and high-quality quantization method designed to make any Large Language Models smaller while keeping their accuracy almost intact.

🔍 What You’ll Find Here

1. How does SINQ work?
2. Why should I use SINQ?
<u>3. Quantize (and save) any LLM with SINQ</u>
4. Run pre-quantized SINQ models from Hugging Face
5. How to reproduce paper results
6. Pre-SINQ: SINQhorn normalization for GGUFs (and more)!
7. Ongoing updates on new features and integrations
8. How to Cite This Work
9. Related Repositories

📊 Feature Comparison: SINQ vs HQQ (calibration-free) and A-SINQ vs AWQ (calibrated)

| Feature | SINQ | HQQ | A-SINQ | AWQ | |------------|:--------:|:--------:|:----------:|:-------:| | 🎯 Calibration | Calibration-free | Calibration-free | Calibrated | Calibrated | | 🧮 Quantization Type | Symmetric & Asymmetric | Asymmetric only | Symmetric & Asymmetric | Symmetric & Asymmetric | | 📦 NF4 Support | Yes | No | Yes | No | | ⚡ Quantization Speed | ~2× Faster than HQQ | Slower | ~4× Faster than AWQ | Slower | | 📈 Model Quality | Higher | Lower | Higher | Lower |

📄 Want to know more? Read our paper on arXiv!

1. How does SINQ work?

<details> <summary>Click to expand a quick explanation of SINQ’s core idea</summary>

1️⃣ Dual-Scaling for Better Quantization

Conventional quantization uses one scale per weight dimension, which makes models vulnerable to outliers: large weights that distort scaling and cause significant errors.

SINQ solves this by introducing dual scaling: separate scale factors for rows and columns. This flexibility redistributes outlier influence and keeps quantization errors smaller and more balanced.

2️⃣ More Even Error Distribution

With standard single-scale quantization, errors tend to cluster around outliers.
With SINQ, they become spread out and less severe, preserving model accuracy even at 3 bit precision. This improvement is driven by SINQ’s Sinkhorn-normalized optimization, which iteratively rescales rows and columns to balance their variance - a process inspired by Sinkhorn matrix normalization. By reducing the overall matrix imbalance (refer to the paper for more info), weights become inherently easier to quantize, leading to more stable behavior across layers and consistently higher accuracy even at very low bit-widths.

</details>

2. Why should I use SINQ?

<details> <summary>Click to expand a quick explanation on why you should use SINQ to quantize your LLM</summary>

SINQ (calibration-free)

Higher LLM quality and ~2× faster quantization than HQQ
>31× faster quantization process and comparable or better LLM quality compared to AWQ / GPTQ
Model-agnostic: works without knowing the specific LLM architecture, unlike QuaRot
Training-free: it does not require end-to-end training, unlike SpinQuant or KurTail
Additionally, A-SINQ (calibrated) further beats AWQ, GPTQ, and Hadamard+GPTQ on quality while achieving >4× faster quantization time.

Example

⏱️ SINQ quantizes Qwen3-14B in just ~21 sec and DeepSeekV2.5-236B in ~5 min on a single GPU
💾 Enables you to run DeepSeekV2.5-236B on a single GPU with ~110 GB of memory (vs ~472 GB) while losing < 1 ppl on WikiText2 and C4

</details>

3. Quantize any LLM with SINQ

There are two ways to use SINQ: directly through the Hugging Face Transformers integration, or by cloning this repository and using the full SINQ implementation.

<details> <summary><strong>Option 1) Directly run with HF Transformers</strong></summary> <br>

Since SINQ is now integrated into 🤗 Hugging Face Transformers (more info here), you can quantize models directly using the native Transformers API without installing SINQ separately (SINQ only, ASINQ is not supported on HF).

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, SinqConfig

model_name = "Qwen/Qwen3-1.7B"

# Create SINQ quantization config
quant_cfg = SinqConfig(
    nbits=4,
    group_size=64,
    modules_to_not_convert=["lm_head"],
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load and quantize model in one step
qmodel = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quant_cfg,
    dtype=torch.bfloat16,
)

# Model is ready for inference

This uses the built-in Transformers integration and it requires:

pip install sinq #sinq.__version__ >= 0.1.7.post1

</details>

Option 2) SINQ via repo cloning

First, clone the repository and install the dependencies:

# 1. Clone the repository
git clone https://github.com/huawei-csl/SINQ.git
cd SINQ

# 2. Install dependencies
pip install -r req.txt

# 3. Install SINQ
pip install .

Quantize in a few lines

Quantizing any 🤗 Hugging Face model with SINQ is simple and takes only a few lines of code:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from sinq.patch_model import AutoSINQHFModel
from sinq.sinqlinear import BaseQuantizeConfig

model_name = "Qwen/Qwen3-1.7B"
device = "cuda:0"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_name)

quant_cfg = BaseQuantizeConfig(
    nbits=4,            # quantization bit-width
    group_size=64,      # group size
    tiling_mode="1D",   # tiling strategy
    method="sinq"       # quantization method ("asinq" for the calibrated version)
)

qmodel = AutoSINQHFModel.quantize_model(
    model,
    tokenizer=tokenizer,
    quant_config=quant_cfg,
    compute_dtype=torch.bfloat16,
    device=device
)

✅ That’s it. Your model is now quantized with SINQ and ready for inference or saving.

Optional Flags

You can further customize the quantization process to balance accuracy and memory for your needs.
Here’s a summary of the main arguments you can tune:

| Flag | Description | Options | Default | |------|-------------|---------|----------| | --nbits | Bit-width for weight quantization | 2, 3, 4, 5, 6, 8 | 4 | | --tiling_mode | Weight matrix tiling strategy | 1D, 2D | 1D | | --group_size | Weights per quantization group | 64, 128 | 64 | | --method | Quantization method | sinq, asinq | sinq |

💡 Tip: For most cases, the defaults (`--nbits 4 --tiling

SINQ

Install / Use

README

News:

🚀 Welcome to the official SINQ repository!

🔍 What You’ll Find Here

📊 Feature Comparison: SINQ vs HQQ (calibration-free) and A-SINQ vs AWQ (calibrated)

1. How does SINQ work?

1️⃣ Dual-Scaling for Better Quantization

2️⃣ More Even Error Distribution

2. Why should I use SINQ?

SINQ (calibration-free)

3. Quantize any LLM with SINQ

Option 2) SINQ via repo cloning

Quantize in a few lines

Optional Flags