SkillAgentSearch skills...

Ao

PyTorch native quantization and sparsity for training and inference

Install / Use

/learn @pytorch/Ao
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<div align="center">

TorchAO

</div>

PyTorch-Native Training-to-Serving Model Optimization

  • Pre-train Llama-3.1-70B 1.5x faster with float8 training
  • Recover 67% of quantized accuracy degradation on Gemma3-4B with QAT
  • Quantize Llama-3-8B to int4 for 1.89x faster inference with 58% less memory
<div align="center">

license

Latest News | Overview | Quick Start | Installation | Integrations | Inference | Training | Videos | Citation

</div>

📣 Latest News

<details> <summary>Older news</summary> </details>

🌅 Overview

TorchAO is an easy to use quantization library for native PyTorch. TorchAO works out-of-the-box with torch.compile() and FSDP2 across most HuggingFace PyTorch models.

For a detailed overview of stable and prototype workflows for different hardware and dtypes, see the Workflows documentation.

Check out our docs for more details!

🚀 Quick Start

First, install TorchAO. We recommend installing the latest stable version:

pip install torchao

Quantize your model weights to int4!

import torch
from torchao.quantization import Int4WeightOnlyConfig, quantize_
if torch.cuda.is_available():
  # quantize on CUDA
  quantize_(model, Int4WeightOnlyConfig(group_size=32, int4_packing_format="tile_packed_to_4d", int4_choose_qparams_algorithm="hqq"))
elif torch.xpu.is_available():
  # quantize on XPU
  quantize_(model, Int4WeightOnlyConfig(group_size=32, int4_packing_format="plain_int32"))

See our quick start guide for more details.

🛠 Installation

To install the latest stable version:

pip install torchao
<details> <summary>Other installation options</summary>
# Nightly
pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu128

# Different CUDA versions
pip install torchao --index-url https://download.pytorch.org/whl/cu126  # CUDA 12.6
pip install torchao --index-url https://download.pytorch.org/whl/cu129  # CUDA 12.9
pip install torchao --index-url https://download.pytorch.org/whl/xpu    # XPU
pip install torchao --index-url https://download.pytorch.org/whl/cpu    # CPU only

# For developers
# Note: the `--no-build-isolation` flag is required.
USE_CUDA=1 pip install -e . --no-build-isolation
USE_XPU=1 pip install -e . --no-build-isolation
USE_CPP=0 pip install -e . --no-build-isolation
</details>

Please see the torchao compability table for version requirements for dependencies.

Optional Dependencies

MSLK is an optional runtime dependency that provides accelerated kernels for some of the workflows in torchao. Stable MSLK should be used with stable torchao, and nightly MSLK with nightly torchao.

# Stable
pip install mslk-cuda==1.0.0

# Nightly
pip install --pre mslk --index-url https://download.pytorch.org/whl/nightly/cu128

🔎 Inference

TorchAO delivers substantial performance gains with minimal code changes:

Following is our recommended flow for quantization and deployment:

from transformers import TorchAoConfig, AutoModelForCausalLM
from torchao.quantization import Float8DynamicActivationFloat8WeightConfig, PerRow

# Create quantization configuration
quantization_config = TorchAoConfig(quant_type=Float8DynamicActivationFloat8WeightConfig(granularity=PerRow()))

# Load and automatically quantize
quantized_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3-32B",
    dtype="auto",
    device_map="auto",
    quantization_config=quantization_config
)

Alternative quantization API to use when the above doesn't work is quantize_ API in quick start guide.

Serving with vllm on 1xH100 machine:

# Server
VLLM_DISABLE_COMPILE_CACHE=1 vllm serve pytorch/Qwen3-32B-FP8 --tokenizer Qwen/Qwen3-32B -O3
# Client
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "pytorch/Qwen3-32B-FP8",
  "messages": [
    {"role": "user", "content": "Give me a short introduction to large language models."}
  ],
  "temperature": 0.6,
  "top_p": 0.95,
  "top_k": 20,
  "max_tokens": 32768
}'

For diffusion models, you can quantize using Hugging Face diffusers

import torch
from diffusers import DiffusionPipeline, PipelineQuantizationConfig, TorchAoConfig
from torchao.quantization import Int8WeightOnlyConfig
from torchao.quantization.granularity import PerGroup

pipeline_quant_config = PipelineQuantizationConfig(
    quant_mapping={"transformer": TorchAoConfig(Int8WeightOnlyConfig(granularity=PerGroup(128)))}
)
pipeline = DiffusionPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    quantization_config=pipeline_quant_config,
    torch_dtype=torch.bfloat16,
    device_map="cuda"
)

We also support deployment to edge devices through ExecuTorch, for more detail, see quantization and serving guide. We also release pre-quantized models here.

🚅 Training

Quantization-Aware Traini

Related Skills

View on GitHub
GitHub Stars2.8k
CategoryDevelopment
Updated3h ago
Forks477

Languages

Python

Security Score

85/100

Audited on Apr 7, 2026

No findings