SkillAgentSearch skills...

MInference

[NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 while maintaining accuracy.

Install / Use

/learn @microsoft/MInference
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<p align="center"> <picture> <img alt="MInference" src="https://raw.githubusercontent.com/microsoft/MInference/main/images/MInference_logo.png" width=70%> </picture> </p> <h2 align="center">MInference: Million-Tokens Prompt Inference for Long-context LLMs</h2> <p align="center"> | <a href="https://aka.ms/MInference"><b>Project Page</b></a> | <a href="https://arxiv.org/abs/2407.02490"><b>Paper</b></a> | <a href="https://huggingface.co/spaces/microsoft/MInference"><b>HF Demo</b></a> | <a href="https://aka.ms/SCBench"><b>SCBench</b></a> | <a href="https://aka.ms/MMInference"><b>MMInference</b></a> | </p>

https://github.com/microsoft/MInference/assets/30883354/52613efc-738f-4081-8367-7123c81d6b19

Now, you can process 1M context 10x faster in a single A100 using Long-context LLMs like LLaMA-3-8B-1M, GLM-4-1M, with even better accuracy, try MInference 1.0 right now!

📰 News

  • 🐝 [25/05/02] MMInference has been accepted at ICML'25.
  • 👨‍💻‍ [25/04/14] SGLang and vLLM have merged the MInference sparse attention kernel. MInference already supports the optimized kernels. Just try pip install sglang. You can achieve up to 1.64× (64K), 2.4× (96K), 2.9× (128K), 5.2× (256K), 8× (512K), and 15× (1M) speedup. Notably, SGLang also adapted it for FlashAttention-3. Special thanks to @zhyncs and @yinfan98 for their contributions!
  • 👾 [25/04/23] We are excited to announce the release of our multi-modality work, MMInference, which use modality-aware permutation sparse attention to accelerate long-context VLMs. We'll present MMInference at Microsoft Booth and FW-Wild at ICLR'25. See you in Singapore!
  • 🤗 [25/01/27] MInference has been integrated into Qwen2.5-1M and online services. For details, refer to the paper and the vLLM implementation.
  • 🪸 [25/01/23] SCBench has been accepted at ICLR'25.
<details> <summary>More News</summary> <ul> <li> 🍩 [24/12/13] We are excited to announce the release of our KV cache-centric analysis work, <a href="https://aka.ms/SCBench">SCBench</a>, which evaluates long-context methods from a KV cache perspective.</li> <li> 🧤 [24/09/26] MInference has been accepted as <b>spotlight</b> at <b>NeurIPS'24</b>. See you in Vancouver!</li> <li> 👘 [24/09/16] We are pleased to announce the release of our KV cache offloading work, <a href="https://aka.ms/RetrievalAttention">RetrievalAttention</a>, which accelerates long-context LLM inference via vector retrieval.</li> <li> 🥤 [24/07/24] MInference supports <a href="https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct">meta-llama/Meta-Llama-3.1-8B-Instruct</a> now.</li> <li> 🪗 [24/07/07] Thanks @AK for sponsoring. You can now use MInference online in the <a href="https://huggingface.co/spaces/microsoft/MInference">HF Demo</a> with ZeroGPU.</li> <li> 📃 [24/07/03] Due to an issue with arXiv, the PDF is currently unavailable there. You can find the paper at this <a href="https://export.arxiv.org/pdf/2407.02490">link</a>.</li> <li> 🧩 [24/07/03] We will present <b>MInference 1.0</b> at the <b><i>Microsoft Booth</i></b> and <b><i>ES-FoMo</i></b> at ICML'24. See you in Vienna!</li> </ul> </details>

TL;DR

MInference 1.0 leverages the dynamic sparse nature of LLMs' attention, which exhibits some static patterns, to speed up the pre-filling for long-context LLMs. It first determines offline which sparse pattern each head belongs to, then approximates the sparse index online and dynamically computes attention with the optimal custom kernels. This approach achieves up to a 10x speedup for pre-filling on an A100 while maintaining accuracy.

SCBench analyzes long-context methods from a KV cache-centric perspective across the full KV cache lifecycle (e.g., KV cache generation, compression, retrieval, and loading). It evaluates 12 tasks under two shared context modes, covering four categories of long-context capabilities: string retrieval, semantic retrieval, global information, and multi-task scenarios.

MMInference use modality-aware permutation sparse attention to accelerate long-context VLMs inference in prefilling-stage. Specifically, we implement three distinct permutation-based sparse attention mechanisms, with FlashAttention, FlashDecoding and PIT, to address the grid patterns in vision inputs and the modality boundary issues in mixed-modality scenarios.

🎥 Overview

Onepage of MInference Onepage of SCBench Onepage of MMInference

🎯 Quick Start

Requirements

  • Torch
  • FlashAttention-2 (Optional)
  • Triton
  • Transformers >= 4.46.0

To get started with MInference, simply install it using pip:

pip install minference

Supported Efficient Methods

You can get the complete list of supported efficient methods by running the following code:

from minference import MInferenceConfig
supported_attn_types = MInferenceConfig.get_available_attn_types()
supported_kv_types = MInferenceConfig.get_available_kv_types()

Currently, we support the following long-context methods:

For more details about the KV cache lifecycle, please refer to SCBench. Note that some modes are supported by vLLM, while all modes are supported by HF.

Supported Models

General MInference supports any decoding LLMs, including LLaMA-style models, and Phi models. We have adapted nearly all open-source long-context LLMs available in the market. If your model is not on the supported list, feel free to let us know in the issues, or you can follow the guide to manually generate the sparse heads config.

You can get the complete list of supported LLMs by running:

from minference import get_support_models
get_support_models()

Currently, we support the following LLMs:

How to use MInference

[!TIP] To benefit from fast kernel implementations, we recommend installing SGLang or vLLM. for sglang

uv pip install "sglang[all]>=0.4.6.post4"

for vll

View on GitHub
GitHub Stars1.2k
CategoryDevelopment
Updated1d ago
Forks77

Languages

Python

Security Score

95/100

Audited on Mar 31, 2026

No findings