Aimet
AIMET is a library that provides advanced quantization and compression techniques for trained neural network models.
Install / Use
/learn @quic/AimetREADME

<img src="https://qaihub-public-assets.s3.us-west-2.amazonaws.com/aimet/docs/assets/images/readme/button-overview.png" width="90" height="40"> <img src="https://qaihub-public-assets.s3.us-west-2.amazonaws.com/aimet/docs/assets/images/readme/button-docs.png" width="90" height="40"> <img src="https://qaihub-public-assets.s3.us-west-2.amazonaws.com/aimet/docs/assets/images/readme/button-install.png" width="90" height="40"> <img src="https://qaihub-public-assets.s3.us-west-2.amazonaws.com/aimet/docs/assets/images/readme/button-forum.png" width="90" height="40"> <img src="https://qaihub-public-assets.s3.us-west-2.amazonaws.com/aimet/docs/assets/images/readme/button-slack.png" width="90" height="40"> <img src="https://qaihub-public-assets.s3.us-west-2.amazonaws.com/aimet/docs/assets/images/readme/button-new.png" width="90" height="40">
AI Model Efficiency Toolkit (AIMET)
<a href="https://quic.github.io/aimet-pages/index.html">AIMET</a> is a software toolkit for quantizing trained ML models.
AIMET improves the runtime performance of deep learning models by reducing compute load and memory footprint. Models quantized with AIMET facilitate its deployment on edge devices like mobile phones or laptops by reducing memory footprint.
AIMET employs post-training and fine-tuning techniques to minimize accuracy loss during quantization and compression. AIMET supports models from the ONNX and PyTorch frameworks.

AIMET is designed to work with PyTorch and ONNX models.
You can find models quantized with AIMET on Qualcomm AI Hub Models - a collection of optimized and quantized models.
Why AIMET?

- Advanced quantization techniques: Inference using integer runtimes is significantly faster than using floating-point runtimes. For example, models run 5x-15x faster on the Qualcomm Hexagon DSP than on the Qualcomm Kyro CPU. In addition, 8-bit precision models have a 4x smaller footprint than 32-bit precision models. However, maintaining model accuracy when quantizing ML models is often challenging. AIMET solves this using novel techniques like Data-Free Quantization that provide state-of-the-art INT8 results on several popular models.
- Supports advanced model compression techniques that enable models to run faster at inference-time and require less memory
- AIMET is designed to automate optimization of neural networks avoiding time-consuming and tedious manual tweaking. AIMET also provides user-friendly APIs that allow users to make calls directly from their PyTorch pipelines.
Please visit the AIMET on Github Pages for more details.
Quick Start
aimet-onnx and aimet-torch is available on PyPI.
Check our Quick Start to get started with latest AIMET package.
Build from source
To build the latest AIMET code from the source, see Build, install and run AIMET from source in Docker environment
Supported Features
Post-Training Quantization(PTQ)
Check out guide to get started on PTQ technique.
Following table summarizes basic technique such as Calibration to advanced techniques such as SeqMSE and Adaptive Rounding(AdaRound) that you can use with AIMET.
| Technique | ONNX | PyTorch | What does it do? | | -- | -- | -- | -- | | Calibration | ✅ | ✅ | Computes Quantization parameters | | AdaRound | ✅ | ✅ | Rounds quantized weights | | SeqMSE | ✅ | ✅ | Optimizes encodings for each layer | | BatchNorm Folding | ✅ | ✅ | Folds batchnorm to bridge the gap between simulation and on-target | | Cross Layer Equalization | ✅ | ✅ | Rescales the weight to reduce range imbalance | | BatchNorm re-estimation | ✅ | ✅ | Re-estimates batchnorm statistics | | AdaScale | ✅ | ✅ | Optimizes quantized weights | | OmniQuant | ❌ | ✅ | Optimizes quantized weights | | SpinQuant | ❌ | ✅ | Optimizes quantized weights |
Quantization Aware Training(QAT)
AIMET supports Quantization Aware Training(QAT) via aimet-torch.
If you want to use both QAT and some of the advanced PTQ techniques from AIMET, we recommend the following workflow:

Check detailed QAT guide here
Model Compression
- Spatial SVD: Tensor decomposition technique to split a large layer into two smaller ones
- Channel Pruning: Removes redundant input channels from a layer and reconstructs layer weights
- Per-layer compression-ratio selection: Automatically selects how much to compress each layer in the model
Visualization
- Weight ranges: Inspect visually if a model is a candidate for applying the Cross Layer Equalization technique. And the effect after applying the technique
- Per-layer compression sensitivity: Visually get feedback about the sensitivity of any given layer in the model to compression
Results
AIMET can quantize an existing 32-bit floating-point model to an 8-bit fixed-point model without sacrificing much accuracy and without model fine-tuning.
<h4>DFQ</h4>The DFQ method applied to several popular networks, such as MobileNet-v2 and ResNet-50, result in less than 0.9% loss in accuracy all the way down to 8-bit quantization, in an automated way without any training data.
<table style="width:50%"> <tr> <th style="width:80px">Models</th> <th>FP32</th> <th>INT8 Simulation </th> </tr> <tr> <td>MobileNet v2 (top1)</td> <td align="center">71.72%</td> <td align="center">71.08%</td> </tr> <tr> <td>ResNet 50 (top1)</td> <td align="center">76.05%</td> <td align="center">75.45%</td> </tr> <tr> <td>DeepLab v3 (mIOU)</td> <td align="center">72.65%</td> <td align="center">71.91%</td> </tr> </table> <br> <h4>AdaRound (Adaptive Rounding)</h4> <h5>ADAS Object Detect</h5> <p>For this example ADAS object detection model, which was challenging to quantize to 8-bit precision, AdaRound can recover the accuracy to within 1% of the FP32 accuracy.</p> <table style="width:50%"> <tr> <th style="width:80px" colspan="15">Configuration</th> <th>mAP - Mean Average Precision</th> </tr> <tr> <td colspan="15">FP32</td> <td align="center">82.20%</td> </tr> <tr> <td colspan="15">Nearest Rounding (INT8 weights, INT8 acts)</td> <td align="center">49.85%</td> </tr> <tr> <td colspan="15">AdaRound (INT8 weights, INT8 acts)</td> <td align="center" bgcolor="#add8e6">81.21%</td> </tr> </table> <h5>DeepLabv3 Semantic Segmentation</h5> <p>For some models like the DeepLabv3 semantic segmentation model, AdaRound can even quantize the model weights to 4-bit precision without a significant drop in accuracy.</p> <table style="width:50%"> <tr> <th style="width:80px" colspan="15">Configuration</th> <th>mIOU - Mean intersection over union</th> </tr> <tr> <td colspan="15">FP32</td> <td align="center">72.94%</td> </tr> <tr> <td colspan="15">Nearest Rounding (INT4 weights, INT8 acts)</td> <td align="center">6.09%</td> </tr> <tr> <td colspan="15">AdaRound (INT4 weights, INT8 acts)</td> <td align="center" bgcolor="#add8e6">70.86%</td> </tr> </table> <br> <h4>Quantization for Recurrent Models</h4> <p>AIMET supports quantization simulation and quantization-aware training (QAT) for recurrent models (RNN, LSTM, GRU). Using QAT feature in AIMET, a DeepSpeech2 model with bi-directional LSTMs can be quantized to 8-bit precision with minimal drop in accuracy.</p> <table style="width:50%"> <tr> <th>DeepSpeech2 <br>(using bi-directional LSTMs)</th> <th>Word Error Rate</th> </tr> <tr> <td>FP32</td> <td align="center">9.92%</td> </tr> <tr> <td>INT8</td> <td align="center">10.22%</td> </tr> </table> <br> <h4>Model Compression</h4> <p>AIMET can also significantly compress models. For popular models, such as Resnet-50 and Resnet-18, compression with spatRelated Skills
groundhog
398Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).
isf-agent
a repo for an agent that helps researchers apply for isf funding
workshop-rules
Materials used to teach the summer camp <Data Science for Kids>
last30days-skill
13.4kAI agent skill that researches any topic across Reddit, X, YouTube, HN, Polymarket, and the web - then synthesizes a grounded summary
