MagicPIG
[ICLR2025 Spotlight] MagicPIG: LSH Sampling for Efficient LLM Generation
Install / Use
/learn @Infini-AI-Lab/MagicPIGREADME
Zhuoming Chen<sup>1</sup>, Ranajoy Sadhukhan<sup>1</sup>, Zihao Ye<sup>2</sup>, Yang Zhou<sup>1</sup>, Jianyu Zhang<sup>3,4</sup>, Niklas Nolte<sup>4</sup>, <br> Yuandong Tian<sup>4</sup>, Matthijs Douze<sup>4</sup>, Leon Bottou<sup>3,4</sup>, Zhihao Jia<sup>1</sup>, Beidi Chen<sup>1</sup>
<sup>1</sup> Carnegie Mellon University, <sup>2</sup>University of Washington, <sup>3</sup>New York University, <sup>4</sup>FAIR
For exploring the possibility of GPU-CPU system powered by Locality-Sensitive-Hashing.
</div> <div align="center"> [<a href="https://arxiv.org/abs/2410.16179">Paper</a>] | [<a href="https://sites.google.com/view/ magicpig-llm">Blog</a>] </div> <br>Latest News 📣
- [2024.12] Use FlashInfer to compute the GPU attention parts.
- [2024.12] More efficient and easy-to-use CPU sparse attention.
- [2024.12] Overlap hash table construction and prefilling to hide CPU overhead.
Installation
Commands:
conda create -n magicpig
conda activate magicpig
bash install.sh
Hardware requirements:
Basic: Intel CPUs supporting AVX512.
BFloat16: Intel CPUs supporting AVX512_BF16, GCC Version $\geq$ 11.
Recommended Python version: 3.9/3.10.
Generation
Commands:
cd examples
numactl -C 0-31,52-83 -m 0,1 \
python generation.py \
--model meta-llama/Meta-Llama-3.1-8B-Instruct \
--M 8192 \
--G 256 \
--K 10 \
--L 170 \
--template meta-llama3 \
--data ../data/story.txt
Explanations:
--model : Name or path for a huggingface model (Only Llamas are supported currently).
--M: Maximum sequence length for pre-allocated the VRAM. It should be larger than context length + generation length.
--G: Generation length.
--K, --L: LSH hyper-parameter (when K=0, we use full attention).
--template: Chat template (only support meta-llama3 and meta-llama2 currently).
--data: Source data for generation (.txt file).
Benchmark
Commands:
cd examples
numactl -C 0-31,52-83 -m 0,1 \
python bench.py \
--model meta-llama/Meta-Llama-3.1-8B-Instruct \
--B 1 \
--P 98000 \
--M 98304 \
--K 10 \
--L 150
Explanations:
--model : Name or path for a huggingface model (Only Llamas are supported currently).
--M: Maximum sequence length for pre-allocated the VRAM. It should be larger than --M (by at least 192).
--P: Actual context length for benchmarking.
--B: Batch Size.
--K, --L: LSH hyper-parameter (when K=0, we use full attention).
numactl can improve the throughput. On different CPU platforms, the OpenMP threads need to be manually reset to achieve the best performance (Now we use 64).
library/sparse_attention/sparse_attention.h:Line 10 : ATTENTION_THREADS
library/lsh/lsh.h:Line 12 : LSH_THREADS
The number of threads is recommended to set as (or slightly smaller than) the number of physical CPU cores (not hyper-thread) to achieve the best performance.
Evaluations
Install RULER environments
Commands:
cd evaluations/RULER
pip install -r requirements.txt
Run RULER Benchmark
Commands:
cd evaluations/RULER
python download_nltk.py
bash run.sh llama3-8b-chat-128k synthetic $K $L
replace K and L with the hyper-parameter you want to evaluate.
Currently, we support the following models.
llama3-8b-chat-128k: [meta-llama/Llama-3.1-8B-Instruct], llama3-8b-chat-512k: [princeton-nlp/Llama-3-8B-ProLong-512k-Instruct], mistral-7b-chat-512k: [aws-prototyping/MegaBeam-Mistral-7B-512k], llama3-70b-chat-128k: [meta-llama/Llama-3.1-70B-Instruct],
Notice: This will call the compiled lsh and sparse_attention_cpu to execute the proposed systems in the paper. Require lsh and sparse_attention_cpu are successfully installed.
Not all users/developpers have AVX512 machines. You can still test the accuracy of MagicPIG with the following two mathematically equivalent implementations (Tensor Parallelism & Single GPU with Huggingface) even if you cannot finish the installation. This is not for latency/throughput evaluations but for accuracy evaluations.
Tensor Parallelism (GPU + Mask)
We implement a mathematically equivalent version with tensor parallelism.
Commands:
cd evaluations/RULER
python download_nltk.py
bash run_tensor_parallel.sh llama3-8b-chat-128k synthetic $K $L
replace K and L with the hyper-parameter you want to evaluate.
Single GPU (Huggingface + Mask)
We implement a mathematically equivalent version with huggingface for easy-exporting to other evaluation frameworks (e.g., lm-eval-harness, LongBench).
Commands:
cd evaluations/RULER
python download_nltk.py
bash run_single_gpu.sh llama3-8b-chat-128k synthetic $K $L 4 64 $method 0
replace K and L with the hyper-parameter you want to evaluate.
$method: 0: MagicPIG; 1: Quest; 2: TopK 3: Oracle Sampling
$K: LSH hyper-parameter for MagicPIG and Page Size for Quest
$L: LSH hyper-parameter for MagicPIG and number of selected pages for Quest
Pipeline parallelism can be enabled with Accelerate by adding more GPU ids in Line 26 of run_single_gpu.sh.
This project was made possible thanks to a collaboration with
<a href="https://www.cmu.edu"><img src="https://upload.wikimedia.org/wikipedia/commons/9/9b/Carnegie_Mellon_wordmark.svg" height="20"></a> <a href="https://www.washington.edu/"><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/f/f7/University_of_Washington_signature.svg/2560px-University_of_Washington_signature.svg.png" height="25"></a> <a href="https://www.nyu.edu/"><img src="https://upload.wikimedia.org/wikipedia/en/thumb/5/58/NYU_logo.svg/2560px-NYU_logo.svg.png" height="21"></a> <a href="https://ai.meta.com/research/"><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/7/7b/Meta_Platforms_Inc._logo.svg/2560px-Meta_Platforms_Inc._logo.svg.png" height="21"></a>
Reference
@misc{chen2024magicpiglshsamplingefficient,
title={MagicPIG: LSH Sampling for Efficient LLM Generation},
author={Zhuoming Chen and Ranajoy Sadhukhan and Zihao Ye and Yang Zhou and Jianyu Zhang and Niklas Nolte and Yuandong Tian and Matthijs Douze and Leon Bottou and Zhihao Jia and Beidi Chen},
year={2024},
eprint={2410.16179},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2410.16179},
}
Related Skills
node-connect
347.2kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
108.0kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
347.2kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
347.2kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
