SkillAgentSearch skills...

Omniserve

[MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention

Install / Use

/learn @mit-han-lab/Omniserve
About this skill

Quality Score

0/100

Category

Design

Supported Platforms

Universal

README

OmniServe: Unified and Efficient Inference Engine for Large-Scale LLM Serving

Paper (QServe) | Paper (LServe) | Website (QServe) | Website (LServe)

OmniServe aims to revolutionize large-scale LLM serving by unifying and optimizing key advancements in both low-bit quantization and long-context processing. OmniServe integrates the innovations from QServe, which boosts efficiency with W4A8KV4 quantization and reduces dequantization overheads, and LServe, which accelerates long-context LLM inference through unified sparse attention and hierarchical KV cache management. OmniServe delivers a comprehensive solution for scalable and cost-effective LLM deployment. This unified system addresses the dual challenges of computational complexity and memory overhead, achieving significant speedups in both prefill and decoding stages, while also maximizing GPU throughput and minimizing infrastructure costs.

News

  • [2025/02] 🔥 OmniServe is now publicly available! OmniServe has integrated optimizations from both QServe and LServe into one single LLM inference framework. Experience efficient and accurate inference for both long-context and quantized LLMs with OmniServe now!
  • [2025/02] 🏆 Both QServe and LServe have been accepted by MLSys 2025!
  • [2024/12] 🔥 QServe has been integrated into NVIDIA TensorRT-LLM!
  • [2024/05] 🔥 QServe is publicly released! Check our paper here.

Key Features

OmniServe is a unified, flexible, and efficient LLM serving system designed to support modern large language models and multi-modal language models. With configurable quantization precisions and hybrid sparse attention patterns, OmniServe integrates the strengths of QServe and LServe, enabling efficient processing of both large-batch and long-context inputs, significantly reducing LLM serving costs while maintaining high response quality.

Contents

Installation

  1. Clone this repository and navigate to the corresponding folder:
git clone https://github.com/mit-han-lab/OmniServe
cd OmniServe
  1. Install OmniServe

2.1 LLM setup tutorial

If you hope to serve text-only LLMs, please follow the tutorial below:

conda create -n OmniServe python=3.10 -y
conda activate OmniServe
pip install --upgrade pip  # enable PEP 660 support

conda install -c nvidia cuda-toolkit -y  # This is optional if you prefer to use built-in nvcc

# Install OmniServe package
pip install -e .
pip install flash-attn --no-build-isolation

We recommend starting an interactive python CLI interface and run import flash_attn to check whether FlashAttention-2 is installed successfully. If not, we recommend downloading pre-built wheels from here. Please notice:

  • PyTorch version needs to exactly match with the version specified in the .whl name;
  • Check out both cxx11abiTRUE and cxx11abiFALSE wheels if one of them does not work;
  • It's recommended to match CUDA version specified in the .whl filename, but minor mismatches (e.g. 12.1 vs 12.2, or even 11.8 vs 12.2) usually do not matter.

2.2 Sparse prefilling with Block-Sparse-Attention.

We provide pre-built wheels for Block-Sparse-Attention here. Please download and install the .whl file with pip according to your environment. Similar to flash_attn, We recommend starting an interactive python CLI interface and run import block_sparse_attn to check the installation. Please also notice:

  • PyTorch version needs to exactly match with the version specified in the .whl name;
  • Check out both cxx11abiTRUE and cxx11abiFALSE wheels if one of them does not work;
  • It's recommended to match CUDA version specified in the .whl filename, but minor mismatches (e.g. 12.1 vs 12.2, or even 11.8 vs 12.2) usually do not matter.

To build Block-Sparse-Attention from source, please follow the instructions below:

git clone https://github.com/mit-han-lab/Block-Sparse-Attention.git --recursive
cd Block-Sparse-Attention

pip install packaging
pip install ninja
python setup.py install
<!-- 2.2 [Optional] VLM setup tutorial QServe also supports synthetic caption generation with VILA VLMs. Please follow the [tutorial](#qserve-vlm-installation) for installation details. -->
  1. Compile the CUDA kernels for OmniServe.

Please return to the OmniServe directory and execute the following commands:

pip install ninja   # Install ninja if not already

cd kernels
python setup.py install
  1. If you want to clone our model zoo, please make sure that git-lfs is installed.

OmniServe Model Zoo

We provide pre-quantized checkpoints for multiple model families. For example, for Llama-3-8B model, please run the following commands to download:

# git lfs install  # install git lfs if not already
mkdir -p qserve_checkpoints && cd qserve_checkpoints
git clone https://huggingface.co/mit-han-lab/Llama-3-8B-Instruct-QServe 

For other models, please refer to the detailed support list for the links to download:

| Models | W4A8-per-channel | W4A8-g128 | | --------- | ---------------------- | -------------- | | Llama3 | ✅ 8B/70B | ✅ 8B/70B| | Llama3-Instruct | ✅ 8B/70B | ✅ 8B/70B | | Llama2 | ✅ 7B/13B/70B | ✅ 7B/13B/70B
| Vicuna | ✅ 7B/13B/30B | ✅ 7B/13B/30B | | Mistral | ✅ 7B | ✅ 7B | | Yi | ✅ 34B | ✅ 34B | | Qwen |✅ 72B | ✅ 72B |

For flagship datacenter GPUs such as the A100, it is recommended to use QServe-per-channel, while for inference datacenter GPUs like the L40S, QServe-per-group is the recommended approach.

If you are interested in generating the quantized checkpoints on your own, please follow the instructions in DeepCompressor Library to run QoQ quantization and dump the fake-quantized models. We then provide checkpoint converter to real-quantize and pack the model into QServe format:

python checkpoint_converter.py --model-path <hf-model-path> --quant-path <fake-quant-model-path> --group-size -1 --device cpu
# <fake-quant-model-path> is a directory generated by DeepCompressor, including model.pt and scale.pt

We also provide a script to run the checkpoint converter. The final model will be automatically stored under qserve_checkpoints.

QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving

Paper | Website | DeepCompressor Library

QServe: Efficient and accurate LLM serving system on GPUs with W4A8KV4 quantization (4-bit weights, 8-bit activations, and 4-bit KV cache). Compared with leading industry solution TensorRT-LLM, QServe achieves 1.2x-1.4x higher throughput when serving Llama-3-8B, and 2.4x-3.5x higher throughput when serving Qwen1.5-72B, on L40S and A100 GPUs. QServe also allows users to achieve A100-level throughput on 3x cheaper L40S GPUs.

Q

View on GitHub
GitHub Stars821
CategoryDesign
Updated4d ago
Forks61

Languages

C++

Security Score

95/100

Audited on Mar 22, 2026

No findings