Mustafar

codebase for MUSTAFAR:Promoting Unstructured Sparsity for KV Pruning in LLM Inference

Generate Convert Improve

Install / Use

/learn @dhjoo98/Mustafar

About this skill

Quality Score

0/100

README

Mustafar: Promoting Unstructured Sparsity for KV Cache Pruning in LLM Inference

Lord Vader approves of unstructured sparsity in KV cache. Paper

Overview

This repository provides:

Dependency setup scripts to reproducibly install/build all required Python packages and CUDA kernels.
LongBench Evaluation to reproduce the accuracy evaluations of the paper.
Kernel Evaluation to measure latency and memory usage of LLM inference with Mustafar Attention Kernel

Prerequisites

Linux (Ubuntu 20.04+ recommended)
Python 3.10+
NVIDIA GPU with CUDA 12.x or higher
pip installed
[Optional] Predownloaded huggingface transformers weight cache of models to test: we currently support Llama-2, Llama-3, and Mistral-7B-Instruct-v0.2

Part I: Install Dependencies

We recommend first initializing a venv or a conda environment.
Install the requirements.
```
pip install -r requirements.txt 
```
Build the CUDA kernel
```
cd /kernel/build  
make 
```
Optionally, speedup build with
```
make -jN
```
where N is the number of build process
Install the PyTorch extension with
```
cd ../kernel_wrapper
pip install -e . 
```

Part II: LongBench Evaluation

This component runs end-to-end model benchmarks on the LongBench suite.

Under /model, contains several pruning methods for Llama, and Per-token, Magnitude-based Pruning for Mistral-7B-Instruct-v0.2.

Following explains the naming convention:

K/V[t/c]_Mag/Opt: denotes the combination of pruning strategy explored in the paper.
- K/V
  Whether this is a Key or Value cache pruning.
- [t / c]
  Pruning direction:
  - t = token-wise
  - c = channel-wise
- Mag/Opt
  Pruning method:
  - Mag = Magnitude-based
  - Opt = Output-aware
For example,

| Folder name | Cache | Direction | Method | |--------------|-------|-------------|-------------------| | Kt_Mag | Key | token-wise | magnitude-based | | Vc_Opt | Value | channel-wise| output-aware |

Additionally, llama_think.py and llama_thinv.py refers to applying the structured pruning method of Xu et. al ThinK: Thinner Key Cache by Query-Driven Pruning to Key and Value Cache, respectively.
Run the evaluation script:

Before running, go to /pred_long_bench.py Line 139. to select the pruning method to test on.
```
bash long_test.sh ${k_sparsity} ${v_sparsity} ${model} ${mode}
```
k_sparsity / v_sparsity refers to the target sparsity for KV cache. I.e., 50% sparsity is 0.5, 70% sparsity is 0.7

The paper tested with the following model params:
- Llama-2-7B: meta-llama/Llama-2-7b-hf
- Llama-3-8B-Instruct: meta-llama/Meta-Llama-3-8B-Instruct
- Mistral-7B-Instruct-v0.2: mistralai/Mistral-7B-Instruct-v0.2
for mode, use 'mustafar' for llama and 'mustafar-mistral' for mistral mode family.
Generate LongBench Score from the evaluation run

the previous script generates generation outputs on /pred directory.

Generate the LongBench score by running the following:
```
python eval_long_bench.py --model ${subdir_name}
```
subdir_name refers to the generated subdirectory under /pred for each run. i.e. Llama-2-7b-hf_4096_K_0.7_V_0.7

Part III: Kernel Evaluation

/kernel directory contains source code for compression Triton kernel and batched SpMV CUDA kernel

Make sure that the CUDA kernel is built and ported to python with steps from Installation

To evaluate on Longbench with the Mustafar Sparse Attention Kernel, go to /pred_long_bench.py Line 139. to select the 'kernel' method to test on.

Then, follow the steps of Part2
To evaluate the latency and memory consumption of the Mustafar Sparse Attention Kernel, run
```
python mem_spd_test.py
```
Input and Generation sequence length, as well as batch size can be controlled within the python code.

We currently support Llama-2 7B and Llama-3 8B for our kernel. Additional model support will soon be released.

Citation

If you use Mustafar in your research, please cite:

@article{mustafar2025,
  title={Mustafar: Promoting Unstructured Sparsity for KV Cache Pruning in LLM Inference},
  author={Donghyeon Joo, Helya Hosseini, Ramyad Hadidi, Bahar Asgari},
  booktitle={Proceedings of the 39st International Conference on Neural Information Processing Systems (NeurIPS)},
  year={2025}
}

Acknowledgments

This repository is heavily influenced by the excellent work in Xia et al. FlashLLM and Liu et al. KIVI. Portions of the codebase and design were adapted and modified to suit Mustfar.

We are grateful to the authors for their contributions to the open-source community.

Related Skills

node-connect

352.0k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

111.1k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

352.0k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

352.0k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。