SqueezeLLM: Dense-and-Sparse Quantization [Paper]

SqueezeLLM is a post-training quantization framework that incorporates a new method called Dense-and-Sparse Quantization to enable efficient LLM serving.

TLDR: Deploying LLMs is difficult due to their large memory size. This can be addressed with reduced precision quantization. But a naive method hurts performance. We address this with a new Dense-and-Sparse Quantization method. Dense-and-Sparse splits weight matrices into two components: A dense component that can be heavily quantized without affecting model performance, as well as a sparse part that preserves sensitive and outlier parts of the weight matrices With this approach, we are able to serve larger models with smaller memory footprint, the same latency, and yet higher accuracy and quality. For instance, the Squeeze variant of the Vicuna models can be served within 6 GB of memory and reach 2% higher MMLU than the baseline model in FP16 with an even 2x larger memory footprint. For more details please check out our paper.

Updates (2/5): Dense and sparse quantization and packing codes for custom models are now available.

Updates (11/28): Mistral model is now supported.

News (10/21): SqueezeLLM is now supported within the official vLLM framework.

Updates (9/30): The code for quantizing custom models is now available (link).

Installation

Create a conda environment

conda create --name sqllm python=3.9 -y
conda activate sqllm

Clone and install the dependencies

git clone https://github.com/SqueezeAILab/SqueezeLLM
cd SqueezeLLM
pip install -e .
cd squeezellm
python setup_cuda.py install

From-scratch Quantization

To quantize your own models, follow the procedure in this link.

Supported Models

Currently, we support LLaMA 7B, 13B, 30B and 65B, LLaMA-2 7B and 13B, instruction-tuned Vicuna 7B and 13B, XGen 7B with 8k sequence length, and OPT 1.3B to 30B. For each model, we support 3-bit and 4-bit quantized models, with sparse levels of 0% (dense-only), 0.05%, and 0.45%. See our Paper for more detailed information on these configurations. Below are the links to download the models.

LLaMA (v1)

| Model | Bitwidth | Dense-only (0%) | 0.05% Sparsity | 0.45% sparsity | | -------- | -------- | -------- | ------ | ---- | | LLaMA-7B | 3 | sq-llama-7b-w3-s0 | sq-llama-7b-w3-s5 | sq-llama-7b-w3-s45 | | LLaMA-7B | 4 | sq-llama-7b-w4-s0 | sq-llama-7b-w4-s5 | sq-llama-7b-w4-s45 | | LLaMA-13B | 3 | sq-llama-13b-w3-s0 | sq-llama-13b-w3-s5 | sq-llama-13b-w3-s45 | | LLaMA-13B | 4 | sq-llama-13b-w4-s0 | sq-llama-13b-w4-s5 | sq-llama-13b-w4-s45 | | LLaMA-30B | 3 | sq-llama-30b-w3-s0 | sq-llama-30b-w3-s5 | sq-llama-30b-w3-s45 | | LLaMA-30B | 4 | sq-llama-30b-w4-s0 | sq-llama-30b-w4-s5 | sq-llama-30b-w4-s45 | | LLaMA-65B | 3 | sq-llama-65b-w3-s0 | sq-llama-65b-w3-s5 | sq-llama-65b-w3-s45 | | LLaMA-65B | 4 | sq-llama-65b-w4-s0 | sq-llama-65b-w4-s5 | sq-llama-65b-w4-s45 |

LLaMA-2

| Model | Bitwidth | Dense-only (0%) | | -------- | -------- | -------- | | LLaMA-2-7B | 3 | sq-llama-7b-w3-s0 | | LLaMA-2-7B | 4 | sq-llama-7b-w4-s0 | | LLaMA-2-13B | 3 | sq-llama-13b-w3-s0 | | LLaMA-2-13B | 4 | sq-llama-13b-w4-s0 |

Mistral

| Model | Bitwidth | Dense-only (0%) | | -------- | -------- | -------- | | Mistral-7B | 3 | sq-mistral-7b-w3-s0 | | Mistral-7B | 4 | sq-mistral-7b-w4-s0 | | Mistral-7B-instruct | 3 | sq-mistral-7b-instruct-w3-s0 | | Mistral-7B-instruct | 4 | sq-mistral-7b-instruct-w4-s0 |

Vicuna (v1.1)

| Model | Bitwidth | Dense-only (0%) | 0.45% sparsity | | -------- | -------- | -------- | ---- | | Vicuna-7B | 3 | sq-vicuna-7b-w3-s0 | sq-vicuna-7b-w3-s45 | | Vicuna-7B | 4 | sq-vicuna-7b-w4-s0 | sq-vicuna-7b-w4-s45 | | Vicuna-13B | 3 | sq-vicuna-13b-w3-s0 | sq-vicuna-13b-w3-s45 | | Vicuna-13B | 4 | sq-vicuna-13b-w4-s0 | sq-vicuna-13b-w4-s45 |

Vicuna (v1.3)

Please refer to the Fastchat documentation for more details about the differences between v1.1 vs v1.3.

| Model | Bitwidth | Dense-only (0%) | | -------- | -------- | -------- | | Vicuna-7B-v1.3 | 3 | sq-vicuna-7b-v1.3-w3-s0 | | Vicuna-7B-v1.3 | 4 | sq-vicuna-7b-v1.3-w4-s0 | | Vicuna-13B-v1.3 | 3 | sq-vicuna-7b-v1.3-w3-s0 | | Vicuna-13B-v1.3 | 4 | sq-vicuna-7b-v1.3-w4-s0 | | Vicuna-30B-v1.3 | 3 | Coming Soon | | Vicuna-30B-v1.3 | 4 | Coming Soon |

XGen (8k Sequence length)

XGen-7B-8k-Base is a 7B model pre-trained under 8K sequence length. XGen-7B-8k-Inst is a supervised finetuned model on public domain instructional data for instruction following applications. Please refer to the blog post from Salesforce AI Research for more details on the models.

| Model | Bitwidth | Dense-only (0%) | 0.45% sparsity | | -------- | -------- | -------- | ---- | | XGen-7B-8k-Base

SqueezeLLM

Install / Use

README