AutoSmoothQuant
An easy-to-use package for implementing SmoothQuant for LLMs
Install / Use
/learn @AniZpZ/AutoSmoothQuantREADME
AutoSmoothQuant
AutoSmoothQuant is an easy-to-use package for implementing smoothquant for LLMs. AutoSmoothQuant speeds up model inference under various workloads. AutoSmoothQuant was created and improved upon from the original work from MIT.
News or Update
- [2024/03] We support model evaluation with lm-evaluation-harness
- [2024/02] We support Mixtral and Baichuan model
Install
Prerequisites
- Your GPU(s) must be of Compute Capability 8.0 or higher. Amphere and later architectures are supported.
- Your CUDA version must be CUDA 11.4 or later.
- Python 3.9+
Build from source
Currently this repo only support build form source. We will release package soon.
git clone https://github.com/AniZpZ/AutoSmoothQuant.git
cd AutoSmoothQuant
pip install -e .
Usage
quantize model
First add a config file named "quant_config.json" to model path. For currenttly supported models, config should be like:
{
"qkv": "per-tensor",
"out": "per-tensor",
"fc1": "per-tensor",
"fc2": "per-tensor",
"type": "int8"
}
"qkv" stands for QKV matmul of attention, "out" stands for out matmul of attention. "fc1" and "fc2" are the layers of the FFNs, which might be referred to as "gate_up" and "down" in Llama-like models. You can set the value to "per-tensor" or "per-token" to perform the quant granularity you want. "type" stands for which kind of datatype you want to quantize model into. Currently we support int8 and fp8(e4m3).
Once config is set, generate scales and do model quantization with following command:
cd autosmoothquant/examples
python3 smoothquant_model.py --model-path /path/to/model --dataset-path /path/to/dataset --smooth-strength 0.5 --quantize-model --generate-scale
use following command for more information
python smoothquant_model.py -help
inference
-
inference with vLLM
Comming soon (this PR could be reference)
If you want to test quantized models with the PR mentioned above, only Llama is supported and quant config should be
{ "qkv": "per-tensor", "out": "per-token", "fc1": "per-tensor", "fc2": "per-token", "type": "int8" } -
inference in this repo
cd autosmoothquant/examples
python3 test_model.py --model-path /path/to/model --tokenizer-path /path/to/tokenizer --model-class llama --prompt="something to say"
benchmark
inference speed
Comming soon (this PR could be reference)
model evaluation
Currently you need to install latest lm-evaluation-harness from source in the same path with AutoSmoothQuant repo.
git clone https://github.com/EleutherAI/lm-evaluation-harness.git
cd lm-evaluation-harness
pip install -e .
cd ../AutoSmoothQuant/autosmoothquant/example
python3 eval_model.py -model-path=/path/to/model --tokenizer-path=/path/to/tokenizer
Supported models
Model support list:
| Models | Sizes | | ---------| ----------------------------| | LLaMA-2 | 7B/13B/70B | | LLaMA | 7B/13B/30B/65B | | Mixtral | 8*7B | | OPT | 6.7B/13B/30B | | Baichuan-2 | 7B/13B | | Baichuan | 7B/13B |
Performance and inference efficency
Detailed data comming soon
Cases:
codellama-13b with A40. Tested with vLLM
llama-13b with A100. Tested with vLLM
Reference
If you find SmoothQuant useful or relevant to your research, please cite their paper:
@InProceedings{xiao2023smoothquant,
title = {{S}mooth{Q}uant: Accurate and Efficient Post-Training Quantization for Large Language Models},
author = {Xiao, Guangxuan and Lin, Ji and Seznec, Mickael and Wu, Hao and Demouth, Julien and Han, Song},
booktitle = {Proceedings of the 40th International Conference on Machine Learning},
year = {2023}
}
Related Skills
node-connect
344.1kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
96.8kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
344.1kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
344.1kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
