Mustafar
codebase for MUSTAFAR:Promoting Unstructured Sparsity for KV Pruning in LLM Inference
Install / Use
/learn @dhjoo98/MustafarREADME
Mustafar: Promoting Unstructured Sparsity for KV Cache Pruning in LLM Inference
<p align="center"> <img src="figs/top_level.png?v=2" alt="MUSTAFAR Architecture" width="600"> </p>Lord Vader approves of unstructured sparsity in KV cache. Paper
📋 Table of Contents
- Overview
- Prerequisites
- Part I: Install Dependencies
- Part II: LongBench Evaluation
- Part III: Kernel Evaluation
Overview
This repository provides:
- Dependency setup scripts to reproducibly install/build all required Python packages and CUDA kernels.
- LongBench Evaluation to reproduce the accuracy evaluations of the paper.
- Kernel Evaluation to measure latency and memory usage of LLM inference with Mustafar Attention Kernel
Prerequisites
- Linux (Ubuntu 20.04+ recommended)
- Python 3.10+
- NVIDIA GPU with CUDA 12.x or higher
pipinstalled- [Optional] Predownloaded huggingface
transformersweight cache of models to test: we currently support Llama-2, Llama-3, and Mistral-7B-Instruct-v0.2
Part I: Install Dependencies
-
We recommend first initializing a venv or a conda environment.
-
Install the requirements.
pip install -r requirements.txt -
Build the CUDA kernel
cd /kernel/build makeOptionally, speedup build with
make -jNwhere N is the number of build process
-
Install the PyTorch extension with
cd ../kernel_wrapper pip install -e .
Part II: LongBench Evaluation
This component runs end-to-end model benchmarks on the LongBench suite.
-
Under
/model, contains several pruning methods for Llama, and Per-token, Magnitude-based Pruning for Mistral-7B-Instruct-v0.2.Following explains the naming convention:
K/V[t/c]_Mag/Opt: denotes the combination of pruning strategy explored in the paper.
- K/V
Whether this is a Key or Value cache pruning. - [t / c]
Pruning direction:t= token-wisec= channel-wise
- Mag/Opt
Pruning method:Mag= Magnitude-basedOpt= Output-aware
For example,
| Folder name | Cache | Direction | Method | |--------------|-------|-------------|-------------------| |
Kt_Mag| Key | token-wise | magnitude-based | |Vc_Opt| Value | channel-wise| output-aware |Additionally, llama_think.py and llama_thinv.py refers to applying the structured pruning method of Xu et. al ThinK: Thinner Key Cache by Query-Driven Pruning to Key and Value Cache, respectively.
- K/V
-
Run the evaluation script:
Before running, go to
/pred_long_bench.pyLine 139. to select the pruning method to test on.bash long_test.sh ${k_sparsity} ${v_sparsity} ${model} ${mode}k_sparsity / v_sparsity refers to the target sparsity for KV cache. I.e., 50% sparsity is 0.5, 70% sparsity is 0.7
The paper tested with the following model params:
- Llama-2-7B: meta-llama/Llama-2-7b-hf
- Llama-3-8B-Instruct: meta-llama/Meta-Llama-3-8B-Instruct
- Mistral-7B-Instruct-v0.2: mistralai/Mistral-7B-Instruct-v0.2
for mode, use 'mustafar' for llama and 'mustafar-mistral' for mistral mode family.
-
Generate LongBench Score from the evaluation run
the previous script generates generation outputs on
/preddirectory.Generate the LongBench score by running the following:
python eval_long_bench.py --model ${subdir_name}subdir_name refers to the generated subdirectory under
/predfor each run. i.e. Llama-2-7b-hf_4096_K_0.7_V_0.7
Part III: Kernel Evaluation
/kernel directory contains source code for compression Triton kernel and batched SpMV CUDA kernel
Make sure that the CUDA kernel is built and ported to python with steps from Installation
-
To evaluate on Longbench with the Mustafar Sparse Attention Kernel, go to
/pred_long_bench.pyLine 139. to select the 'kernel' method to test on.Then, follow the steps of Part2
-
To evaluate the latency and memory consumption of the Mustafar Sparse Attention Kernel, run
python mem_spd_test.pyInput and Generation sequence length, as well as batch size can be controlled within the python code.
We currently support Llama-2 7B and Llama-3 8B for our kernel. Additional model support will soon be released.
Citation
If you use Mustafar in your research, please cite:
@article{mustafar2025,
title={Mustafar: Promoting Unstructured Sparsity for KV Cache Pruning in LLM Inference},
author={Donghyeon Joo, Helya Hosseini, Ramyad Hadidi, Bahar Asgari},
booktitle={Proceedings of the 39st International Conference on Neural Information Processing Systems (NeurIPS)},
year={2025}
}
Acknowledgments
This repository is heavily influenced by the excellent work in Xia et al. FlashLLM and Liu et al. KIVI. Portions of the codebase and design were adapted and modified to suit Mustfar.
We are grateful to the authors for their contributions to the open-source community.
Related Skills
node-connect
352.0kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
111.1kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
352.0kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
352.0kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
