Ao
PyTorch native quantization and sparsity for training and inference
Install / Use
/learn @pytorch/AoREADME
TorchAO
</div>PyTorch-Native Training-to-Serving Model Optimization
- Pre-train Llama-3.1-70B 1.5x faster with float8 training
- Recover 67% of quantized accuracy degradation on Gemma3-4B with QAT
- Quantize Llama-3-8B to int4 for 1.89x faster inference with 58% less memory
Latest News | Overview | Quick Start | Installation | Integrations | Inference | Training | Videos | Citation
</div>📣 Latest News
- [Oct 25] QAT is now integrated into Unsloth for both full and LoRA fine-tuning! Try it out using this notebook.
- [Oct 25] MXFP8 MoE training prototype achieved ~1.45x speedup for MoE layer in Llama4 Scout, and ~1.25x speedup for MoE layer in DeepSeekV3 671b - with comparable numerics to bfloat16! Check out the docs to try it out.
- [Sept 25] MXFP8 training achieved 1.28x speedup on Crusoe B200 cluster with virtually identical loss curve to bfloat16!
- [Sept 19] TorchAO Quantized Model and Quantization Recipes Now Available on Huggingface Hub!
- [Jun 25] Our TorchAO paper was accepted to CodeML @ ICML 2025!
- [May 25] QAT is now integrated into Axolotl for fine-tuning (docs)!
- [Apr 25] Float8 rowwise training yielded 1.34-1.43x training speedup at 2k H100 GPU scale
- [Apr 25] TorchAO is added as a quantization backend to vLLM (docs)!
- [Mar 25] Our 2:4 Sparsity paper was accepted to SLLM @ ICLR 2025!
- [Jan 25] Our integration with GemLite and SGLang yielded 1.1-2x faster inference with int4 and float8 quantization across different batch sizes and tensor parallel sizes
- [Jan 25] We added 1-8 bit ARM CPU kernels for linear and embedding ops
- [Nov 24] We achieved 1.43-1.51x faster pre-training on Llama-3.1-70B and 405B using float8 training
- [Oct 24] TorchAO is added as a quantization backend to HF Transformers!
- [Sep 24] We officially launched TorchAO. Check out our blog here!
- [Jul 24] QAT recovered up to 96% accuracy degradation from quantization on Llama-3-8B
- [Jun 24] Semi-structured 2:4 sparsity achieved 1.1x inference speedup and 1.3x training speedup on the SAM and ViT models respectively
- [Jun 24] Block sparsity achieved 1.46x training speeedup on the ViT model with <2% drop in accuracy
🌅 Overview
TorchAO is an easy to use quantization library for native PyTorch. TorchAO works out-of-the-box with torch.compile() and FSDP2 across most HuggingFace PyTorch models.
For a detailed overview of stable and prototype workflows for different hardware and dtypes, see the Workflows documentation.
Check out our docs for more details!
🚀 Quick Start
First, install TorchAO. We recommend installing the latest stable version:
pip install torchao
Quantize your model weights to int4!
import torch
from torchao.quantization import Int4WeightOnlyConfig, quantize_
if torch.cuda.is_available():
# quantize on CUDA
quantize_(model, Int4WeightOnlyConfig(group_size=32, int4_packing_format="tile_packed_to_4d", int4_choose_qparams_algorithm="hqq"))
elif torch.xpu.is_available():
# quantize on XPU
quantize_(model, Int4WeightOnlyConfig(group_size=32, int4_packing_format="plain_int32"))
See our quick start guide for more details.
🛠 Installation
To install the latest stable version:
pip install torchao
<details>
<summary>Other installation options</summary>
# Nightly
pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu128
# Different CUDA versions
pip install torchao --index-url https://download.pytorch.org/whl/cu126 # CUDA 12.6
pip install torchao --index-url https://download.pytorch.org/whl/cu129 # CUDA 12.9
pip install torchao --index-url https://download.pytorch.org/whl/xpu # XPU
pip install torchao --index-url https://download.pytorch.org/whl/cpu # CPU only
# For developers
# Note: the `--no-build-isolation` flag is required.
USE_CUDA=1 pip install -e . --no-build-isolation
USE_XPU=1 pip install -e . --no-build-isolation
USE_CPP=0 pip install -e . --no-build-isolation
</details>
Please see the torchao compability table for version requirements for dependencies.
Optional Dependencies
MSLK is an optional runtime dependency that provides accelerated kernels for some of the workflows in torchao. Stable MSLK should be used with stable torchao, and nightly MSLK with nightly torchao.
# Stable
pip install mslk-cuda==1.0.0
# Nightly
pip install --pre mslk --index-url https://download.pytorch.org/whl/nightly/cu128
🔎 Inference
TorchAO delivers substantial performance gains with minimal code changes:
- Int4 weight-only: 1.73x speedup with 65% less memory for Gemma3-12b-it on H100 with slight impact on accuracy
- Float8 dynamic quantization: 1.5-1.6x speedup on gemma-3-27b-it and 1.54x and 1.27x speedup on Flux.1-Dev* and CogVideoX-5b respectively on H100 with preserved quality
- Int8 activation quantization and int4 weight quantization: Quantized Qwen3-4B running with 14.8 tokens/s with 3379 MB memory usage on iPhone 15 Pro through ExecuTorch
- Int4 + 2:4 Sparsity: 2.37x throughput with 67.7% memory reduction on Llama-3-8B
Following is our recommended flow for quantization and deployment:
from transformers import TorchAoConfig, AutoModelForCausalLM
from torchao.quantization import Float8DynamicActivationFloat8WeightConfig, PerRow
# Create quantization configuration
quantization_config = TorchAoConfig(quant_type=Float8DynamicActivationFloat8WeightConfig(granularity=PerRow()))
# Load and automatically quantize
quantized_model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen3-32B",
dtype="auto",
device_map="auto",
quantization_config=quantization_config
)
Alternative quantization API to use when the above doesn't work is quantize_ API in quick start guide.
Serving with vllm on 1xH100 machine:
# Server
VLLM_DISABLE_COMPILE_CACHE=1 vllm serve pytorch/Qwen3-32B-FP8 --tokenizer Qwen/Qwen3-32B -O3
# Client
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "pytorch/Qwen3-32B-FP8",
"messages": [
{"role": "user", "content": "Give me a short introduction to large language models."}
],
"temperature": 0.6,
"top_p": 0.95,
"top_k": 20,
"max_tokens": 32768
}'
For diffusion models, you can quantize using Hugging Face diffusers
import torch
from diffusers import DiffusionPipeline, PipelineQuantizationConfig, TorchAoConfig
from torchao.quantization import Int8WeightOnlyConfig
from torchao.quantization.granularity import PerGroup
pipeline_quant_config = PipelineQuantizationConfig(
quant_mapping={"transformer": TorchAoConfig(Int8WeightOnlyConfig(granularity=PerGroup(128)))}
)
pipeline = DiffusionPipeline.from_pretrained(
"black-forest-labs/FLUX.1-dev",
quantization_config=pipeline_quant_config,
torch_dtype=torch.bfloat16,
device_map="cuda"
)
We also support deployment to edge devices through ExecuTorch, for more detail, see quantization and serving guide. We also release pre-quantized models here.
🚅 Training
Quantization-Aware Traini
Related Skills
node-connect
350.8kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
110.4kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
350.8kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
350.8kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
