Flute
Fast Matrix Multiplications for Lookup Table-Quantized LLMs
Install / Use
/learn @HanGuo97/FluteREADME
<a href="https://pypi.org/project/flute-kernel/">
</a>
<a href="https://arxiv.org/abs/2407.10960">
</a>
[Background] [Benchmarks] [Getting Started] [Compatibility] [Model Zoo]
</div>Update
- February, 2024. HIGGS will appear in NAACL 2025.
- January 9, 2025. Added (very) experimental support for removing specialization on shapes + GPU via auto-tune.
- December 12, 2024. Added support for Hadamard Transform (via HadaCore).
- November 26, 2024. Added support for vector (de)quantization (
vector_size=2), as part of HIGGS. - October 5, 2024. FLUTE will appear in EMNLP 2024 (Findings).
- September 15, 2024. Added experimental support for loading pre-quantized FLUTE models in HuggingFace.
- September 6, 2024. Added (unlearned) NF-quantized LLaMA-3.1 (405B) models: base and instruction tuned.
- August 31, 2024. Added support and example for the Learned Normal Float (NFL) quantization.
- August 26, 2024. Added support for converting
bitsandbytesmodel into FLUTE model. - August 5, 2024. Added quantized LLaMA-3.1 (8B/70B) models.
- August 2, 2024. Added support for RTX4090.
- July 27, 2024. Added support for LLaMA-3.1 (405B) and tuned BF16 performance. FP16 is still the recommended data type, especially for 3-bit settings.
Installation
Install FLUTE with pip or from source:
# For CUDA 12.1
pip install flute-kernel
# For CUDA 11.8
pip install flute-kernel -i https://flute-ai.github.io/whl/cu118
# For CUDA 12.4
pip install flute-kernel -i https://flute-ai.github.io/whl/cu124
Head over to Getting Started and try it out!
Background
Uniform quantization converts full precision weights to lower-precision intervals of equal size. Lookup table (LUT) quantization is a flexible variant of non-uniform quantization which can map intervals to arbitrary values via a lookup table.
<table align="center"> <tr> <th>Uniform (Integer) Quantization</th> <th>Lookup Table Quantization</th> </tr> <tr> <td align="center">$$\widehat{\mathbf{W}} = \mathtt{float}(\mathbf{Q}) \cdot \mathbf{s}$$
</td> <td align="center">$$\widehat{\mathbf{W}} = \mathtt{tableLookup}(\mathbf{Q}, \mathtt{table}) \cdot \mathbf{s}$$
</td> </tr> </table>where $\mathbf{Q}$ denote the quantized weight, $\mathbf{s}$ the (group-wise) scales, and $\widehat{\mathbf{W}}$ the de-quantized weight. Here are some examples of the lookup table suppored in FLUTE.
<table align="center"> <tr> <th>Examples</th> <th>Notes</th> </tr> <tr> <td align="left">int4, int3, int2
recovers uniform/integer quantization
</td> </tr> <tr> <td align="left">fp4, fp3, fp2
nf4, nf3, nf2
generalizes the nf4 data-format introduced in QLoRA
any arbitrary table
</td> <td align="left">you could even learn it!
</td> </tr> </table>New Models Powered by FLUTE
The flexibility of the kernel could lead to new quantization algorithms. As a proof of concept, we are releasing a few models quantized using Learned Normal Float (NFL) --- a simple extension to the nf4 data format introduced in QLoRA. NFL initialized the lookup table and the scales with those from NF quantization. Then, it uses calibration data to learn the scales via straight through estimation for for the gradient with respect to the scales.
Benchmarks
For additional benchmarks, detailed breakdowns, and corresponding instruction-tuned models, please refer to the paper and the model zoo.
<p align="center"> <img src="assets/intro-figure.jpg" /> </p>LLaMA-3.1
| | Wiki PPL | C4 PPL | LLM Eval Avg. | | Wiki PPL | C4 PPL | LLM Eval Avg. | | ----------- | ---- | ----- | ----- | ----------- | ---- | ---- | ----- | | LLaMA-3.1 (8B) | 6.31 | 9.60 | 69.75 | LLaMA-3.1 (70B) | 2.82 | 7.18 | 75.45 | | + NFL W4G64 | 6.24 | 10.06 | 69.13 | + NFL W4G64 | 3.09 | 7.53 | 74.84 | | + NFL W3G64 | 7.23 | 11.83 | 65.66 | + NFL W3G64 | 4.29 | 8.91 | 72.65 |
Gemma-2
| | Wiki PPL | C4 PPL | LLM Eval Avg. | | Wiki PPL | C4 PPL | LLM Eval Avg. | | ----------- | ---- | ----- | ----- | ----------- | ---- | ---- | ----- | | Gemma-2 (9B) | 6.88 | 10.12 | 73.12 | Gemma-2 (27B) | 5.70 | 8.98 | 75.71 | | + NFL W4G64 | 6.49 | 10.35 | 72.50 | + NFL W4G64 | 5.69 | 9.31 | 74.11 |
Getting Started
FLUTE + vLLM
FLUTE-quantized models (Model Zoo) can be directly served using exisiting frameworks such as vLLM.
- python -m vllm.entrypoints.openai.api_server \
+ python -m flute.integrations.vllm vllm.entrypoints.openai.api_server \
--model [MODEL] \
--revision [REVISION] \
--tensor-parallel-size [TP_SIZE] \
+ --quantization flute
For example, the following commmand runs the FLUTE-quantized LLaMA-3.1 (8B) on a single GPU.
python -m flute.integrations.vllm vllm.entrypoints.openai.api_server \
--model radi-cho/Meta-Llama-3.1-8B-FLUTE \
--quantization flute
We can then query the vLLM server as usual.
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "radi-cho/Meta-Llama-3.1-8B-FLUTE",
"prompt": "San Francisco is a",
"max_tokens": 7,
"temperature": 0
}'
FLUTE + HuggingFace
FLUTE also runs out of the box with HuggingFace and its accelerate extension. This integration is mostly experimental and not optimized. Users sensitive to performance considerations should use the vLLM integration instead.
- Loading a pre-quantized FLUTE model.
import flute.integrations.huggingface
- model = AutoModelForCausalLM.from_pretrained(
+ model = flute.integrations.huggingface.from_pretrained(
"radi-cho/Meta-Llama-3.1-8B-FLUTE",
# all of your favoriate HF flags will be forwarded
device_map="auto")
- Loading and quantizing a dense model.
import flute.integrations.base
flute.integrations.base.prepare_model_flute(
name="model.model.layers",
module=model.model.layers, # for LLaMA-3 and Gemma-2
num_bits=num_bits,
group_size=group_size,
fake=False,
handle_hooks=True) # for `accelerate` hooks
After this, the model can be used as normal. Please checkout the quantization guide for more information.
Support and Compatibility
Kernel
| Description | Supported (via pip) | Supported (build from source) |
| ----------- | ----------- | ----------- |
| Input dtypes | torch.float16 torch.bfloat16 | |
| Bits | 4bit 3bit | 2bit |
| Group Sizes | 32 64 128 256 | ❓ |
| GPUs | A100 A6000 RTX 4090 | H100 (unoptimized) |
[!WARNING] In the current release, we noticed
torch.bfloat16is slower thantorch.float16. This likely because of lack of tuning, and that Ampere GPUs lack a hardware acceleration forbfloat16vectorized atomic-add.
[!WARNING] We noticed several numerically unstable situations using
bits=4, group-size=256, GPU=A100, though this is relatively rare (8 of 9360 test cases failed). We also noticed correctness issues in some situations withbits=4, group-size=256, dtype=bfloat16, GPU=RTX4090(1 of 52 test cases failed). We will be looking into this, but we suggest avoiding these particular use cases (W4G256) for now.
Models
[!NOTE] As of the current release, the kernel is shape-specialized due to legacy reasons (i.e., we tune tile sizes etc for each matrix shape). Please see the below chart for the supported use cases, as different platform and tensor parallel size changes the matrix shapes. We plan to add supports for a broad range of shapes in the near future. In the meantime, please let us know if you have any specific models in mind and we are happy to add support for them.
| Model | Single GPU / Pipeline Parallel | Tensor Parallel | | ----------- | ----------- | ----------- | | LLaMA-3/3.1 (8B) | ✅ | | | LLaMA-3/3.1 (70B) | ✅ | 2 or 4 GPUs | | LLaMA-3.1 (405B) | ✅ | 4 or 8 GPUs | | Gemma-2 (9B) | ✅ | | | Gemma-2 (27B) | ✅ | 2 or 4 GPUs |
Model Zoo
[!NOTE] The models we release here are trained on more data and hence different from those in the paper.
[!TIP] The HuggingFace Hub links are for
NFL W4G64quantization by default. To use theNFL W3G64quantization, add--revision nfl_w3g64.
[LLaMA-3.1 (8B)](https://huggingface.co/radi-cho/Meta-Llama-3.1-8B-FLUT
Related Skills
node-connect
352.0kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
111.1kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
352.0kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
352.0kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
