<picture> <img alt="MInference" src="https://raw.githubusercontent.com/microsoft/MInference/main/images/MInference_logo.png" width=70%> </picture> <h2 align="center">MInference: Million-Tokens Prompt Inference for Long-context LLMs</h2> | <a href="https://aka.ms/MInference">Project Page</a> | <a href="https://arxiv.org/abs/2407.02490">Paper</a> | <a href="https://huggingface.co/spaces/microsoft/MInference">HF Demo</a> | <a href="https://aka.ms/SCBench">SCBench</a> | <a href="https://aka.ms/MMInference">MMInference</a> |

https://github.com/microsoft/MInference/assets/30883354/52613efc-738f-4081-8367-7123c81d6b19

Now, you can process 1M context 10x faster in a single A100 using Long-context LLMs like LLaMA-3-8B-1M, GLM-4-1M, with even better accuracy, try MInference 1.0 right now!

📰 News

🐝 [25/05/02] MMInference has been accepted at ICML'25.
👨‍💻‍ [25/04/14] SGLang and vLLM have merged the MInference sparse attention kernel. MInference already supports the optimized kernels. Just try pip install sglang. You can achieve up to 1.64× (64K), 2.4× (96K), 2.9× (128K), 5.2× (256K), 8× (512K), and 15× (1M) speedup. Notably, SGLang also adapted it for FlashAttention-3. Special thanks to @zhyncs and @yinfan98 for their contributions!
👾 [25/04/23] We are excited to announce the release of our multi-modality work, MMInference, which use modality-aware permutation sparse attention to accelerate long-context VLMs. We'll present MMInference at Microsoft Booth and FW-Wild at ICLR'25. See you in Singapore!
🤗 [25/01/27] MInference has been integrated into Qwen2.5-1M and online services. For details, refer to the paper and the vLLM implementation.
🪸 [25/01/23] SCBench has been accepted at ICLR'25.

<details> <summary>More News</summary> <ul> <li> 🍩 [24/12/13] We are excited to announce the release of our KV cache-centric analysis work, <a href="https://aka.ms/SCBench">SCBench</a>, which evaluates long-context methods from a KV cache perspective.</li> <li> 🧤 [24/09/26] MInference has been accepted as spotlight at NeurIPS'24. See you in Vancouver!</li> <li> 👘 [24/09/16] We are pleased to announce the release of our KV cache offloading work, <a href="https://aka.ms/RetrievalAttention">RetrievalAttention</a>, which accelerates long-context LLM inference via vector retrieval.</li> <li> 🥤 [24/07/24] MInference supports <a href="https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct">meta-llama/Meta-Llama-3.1-8B-Instruct</a> now.</li> <li> 🪗 [24/07/07] Thanks @AK for sponsoring. You can now use MInference online in the <a href="https://huggingface.co/spaces/microsoft/MInference">HF Demo</a> with ZeroGPU.</li> <li> 📃 [24/07/03] Due to an issue with arXiv, the PDF is currently unavailable there. You can find the paper at this <a href="https://export.arxiv.org/pdf/2407.02490">link</a>.</li> <li> 🧩 [24/07/03] We will present MInference 1.0 at the Microsoft Booth and ES-FoMo at ICML'24. See you in Vienna!</li> </ul> </details>

TL;DR

MInference 1.0 leverages the dynamic sparse nature of LLMs' attention, which exhibits some static patterns, to speed up the pre-filling for long-context LLMs. It first determines offline which sparse pattern each head belongs to, then approximates the sparse index online and dynamically computes attention with the optimal custom kernels. This approach achieves up to a 10x speedup for pre-filling on an A100 while maintaining accuracy.

MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention (NeurIPS'24 spotlight, ES-FoMo @ ICML'24) Huiqiang Jiang†, Yucheng Li†, Chengruidong Zhang†, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang and Lili Qiu

SCBench analyzes long-context methods from a KV cache-centric perspective across the full KV cache lifecycle (e.g., KV cache generation, compression, retrieval, and loading). It evaluates 12 tasks under two shared context modes, covering four categories of long-context capabilities: string retrieval, semantic retrieval, global information, and multi-task scenarios.

SCBench: A KV Cache-Centric Analysis of Long-Context Methods (ICLR'25, ENLSP @ NeurIPS'24) Yucheng Li, Huiqiang Jiang, Qianhui Wu, Xufang Luo, Surin Ahn, Chengruidong Zhang, Amir H. Abdi, Dongsheng Li, Jianfeng Gao, Yuqing Yang and Lili Qiu

MMInference use modality-aware permutation sparse attention to accelerate long-context VLMs inference in prefilling-stage. Specifically, we implement three distinct permutation-based sparse attention mechanisms, with FlashAttention, FlashDecoding and PIT, to address the grid patterns in vision inputs and the modality boundary issues in mixed-modality scenarios.

MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention (ICML'25, FM-Wild @ ICLR'25) Yucheng Li, Huiqiang Jiang, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Amir H. Abdi, Dongsheng Li, Jianfeng Gao, Yuqing Yang and Lili Qiu

🎥 Overview

Onepage of MInference Onepage of SCBench Onepage of MMInference

🎯 Quick Start

Requirements

Torch
FlashAttention-2 (Optional)
Triton
Transformers >= 4.46.0

To get started with MInference, simply install it using pip:

pip install minference

Supported Efficient Methods

You can get the complete list of supported efficient methods by running the following code:

from minference import MInferenceConfig
supported_attn_types = MInferenceConfig.get_available_attn_types()
supported_kv_types = MInferenceConfig.get_available_kv_types()

Currently, we support the following long-context methods:

[① KV Cache Generation]: MInference, xAttention, FlexPrefill, A-shape, Tri-shape, MInference w/ static, Dilated, Strided
[② KV Cache Compression]: StreamingLLM, SnapKV, PyramidKV, KIVI
[③ KV Cache Retrieval]: CacheBlend
[④ KV Cache Loading]: Quest, RetrievalAttention

For more details about the KV cache lifecycle, please refer to SCBench. Note that some modes are supported by vLLM, while all modes are supported by HF.

Supported Models

General MInference supports any decoding LLMs, including LLaMA-style models, and Phi models. We have adapted nearly all open-source long-context LLMs available in the market. If your model is not on the supported list, feel free to let us know in the issues, or you can follow the guide to manually generate the sparse heads config.

You can get the complete list of supported LLMs by running:

from minference import get_support_models
get_support_models()

Currently, we support the following LLMs:

How to use MInference

[!TIP] To benefit from fast kernel implementations, we recommend installing SGLang or vLLM. for sglang
uv pip install "sglang[all]>=0.4.6.post4"
for vll

MInference

Install / Use

README