SkillAgentSearch skills...

SepLLM

[ICML 2025] "SepLLM: Accelerate Large Language Models by Compressing One Segment into One Separator"

Install / Use

/learn @HKUDS/SepLLM
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<!-- <div align="center"> # **SepLLM: Accelerate Large Language Models by Compressing One Segment into One Separator** ### An Easy-to-Use Native Sparse Attention Baseline Method --- </div> --> <div align="center"> <h1 align="center"> <strong>🚀 SepLLM: Accelerate Large Language Models by Compressing One Segment into One Separator</strong> </h1> <h3 align="center">✨ An Easy-to-Use <u><strong>Native Sparse Attention</strong></u> Baseline Method</h3> <h4 align="center"> <a href="https://sepllm.github.io" target="_blank"> <img src="https://cdn.jsdelivr.net/npm/simple-icons@v9/icons/github.svg" alt="GitHub" width="20" height="20" style="vertical-align: middle; margin-right: 8px;"/> sepllm.github.io </a> </h4>
</div>

Large Language Models (LLMs) have exhibited exceptional performance across a spectrum of natural language processing tasks. However, their substantial sizes pose considerable challenges, particularly in computational demands and inference speed, due to their quadratic complexity. In this work, we have identified a key pattern: certain seemingly meaningless separator tokens (i.e., punctuations) contribute disproportionately to attention scores compared to semantically meaningful tokens. This observation suggests that information of the segments between these separator tokens can be effectively condensed into the separator tokens themselves without significant information loss. Guided by this insight, we introduce SepLLM, a plug-and-play framework that accelerates inference by compressing these segments and eliminating redundant tokens. Additionally, we implement efficient kernels for training acceleration. Experimental results across training-free, training-from-scratch, and post-training settings demonstrate SepLLM's effectiveness. Notably, using the Llama-3-8B backbone, SepLLM achieves over 50% reduction in KV cache on the GSM8K-CoT benchmark while maintaining comparable performance. Furthermore, in streaming settings, SepLLM effectively processes sequences of up to 4 million tokens or more while maintaining consistent language modeling capabilities.

image

News

image

  • :star2: [2025/07] The portable SepCache is available on HuggingFace's transformers' official repo now !! It is a plug-and-play Cache class, and we also provide sample code for monkey patching, which supports Llama 3.1 series now. Note that the HuggingFace's SepCache version otherwise needs transformers>=4.53.0,<4.54.0, i.e., the new transformers. See Transformers Community for detailed usage. :rocket::rocket::rocket:
  • :star2: [2025/07] The portable SepCache is available on HuggingFace now !! It is a plug-and-play Cache class, and we also provide sample code for monkey patching, which supports Llama 3.1 series now. Note that the HuggingFace's SepCache version otherwise needs transformers>=4.53.0,<4.54.0, i.e., the new transformers. We are working on integrating SepCache into HuggingFace's Transformers Community. Stay tuned! :rocket::rocket::rocket:
  • :star2: [2025/06] We are working on integrating SepCache into HuggingFace's transformers. Stay tuned! :rocket::rocket::rocket:
  • :star2: [2025/06] SepCache is released, which is an efficient, portable, and easy-to-use Cache class for transformers.
  • :star2: [2025/06] SepLLM's trained checkpoint samples have been uploaded to HuggingFace. :rocket::rocket::rocket:
  • :star: [2025/06] More features have already been supported by the SepLLM code repository, including BiPE (arXiv:2401.16421), Self-Adjust Softmax (arXiv:2502.18277), FixLLM, etc.
  • :star: [2025/06] SepLLM's slides and videos are uploaded.
  • :star: [2025/06] SepLLM's camera-ready paper is released.
  • :star2: [2025/05] SepLLM has been accepted to ICML 2025. :rocket::rocket::rocket:
  • :star: [2024/12] More exciting features are being developed. Stay tuned!
  • :star: [2024/12] SepLLM's code has been released. Our codebase supports efficient multi-node distributed training with accelerated attention module Sep-Attention and also includes numerous existing Fusion Operators to accelerate the training process, such as fused rope (Su et al., 2023), fused layer norm, etc.

Attention Please!

  • Please pay extra attention to your usage and experimental scenarios, and choose the appropriate code subdirectory accordingly (i.e., TrainingFree-SepLLM, Training-SepLLM, Streaming-SepLLM). Some researchers have mistakenly used the Streaming-SepLLM folder's code for general training-free tasks (e.g., GSM8K_CoT, MMLU, etc.), which is incorrect. The Streaming-SepLLM branch requires "Positional Encoding Shifting" like StreamingLLM, whereas general training-free tasks do not, as the context length and generation length required by such general tasks usually do not exceed the maximum length (max_position_embeddings) pre-trained by the model. Besides, there are other detailed differences, which can be found in the code. Due to the above reasons, we refer to Streaming-SepLLM as the "Tailored Streaming Design" in the paper to distinguish it from the "Fundamental Design." (Although we have made these two settings (TrainingFree-SepLLM, Streaming-SepLLM) compatible in SepCache.)

  • To achieve optimal performance on downstream tasks, training from scratch is required (to ensure consistency between training and inference). However, for many downstream tasks, the training-free setting can also deliver quite good performance.

  • Our wheel package ./package/transformers-4.38.0.post1+sepllm-py3-none-any.whl is an extension based on the official transformers-4.38.0. Its main purpose is to incorporate the model sepllm_gpt_neox and to adapt the relevant files of the Llama model under the directory transformers/models/llama/ to meet the requirements of the SepLLM architecture for training-free adaptability. Since the official transformers-4.38.0 supports the meta-llama/Meta-Llama-3-8B-Instruct model, our released ./package/transformers-4.38.0.post1+sepllm-py3-none-any.whl also supports it. However, as the official transformers-4.38.0 does not directly support meta-llama/Llama-3.1-8B-Instruct (Llama 3.1 was developed on transformers-4.42.3), our released transformers does not support it either. You will need to manually migrate and adapt the relevant code of SepLLM to the llama-related code files in transformers-4.42.3 (also under transformers/models/llama/) in order to run Llama-3.1-8B-Instruct directly. Fortunately, once you have understood this README.md and identified the modified parts of code and files needed for the training-free adaptation of SepLLM, this migration should be straightforward! (Especially for SepCache, which can be mostly used via copy-paste!). Similarly easy adaptation for other newer transformers!

  • :star2: [2025/07] The portable SepCache is available on HuggingFace now !! It is a plug-and-play Cache class, and we also provide sample code for monkey patching, which supports Llama 3.1 series now. Note that the HuggingFace's SepCache version otherwise needs transformers>=4.53.0,<4.54.0, i.e., the new transformers. We are working on integrating SepCache into HuggingFace's Transformers Community. Stay tuned! :rocket::rocket::rocket:

1. Overview

View on GitHub
GitHub Stars570
CategoryDevelopment
Updated6d ago
Forks46

Languages

Python

Security Score

85/100

Audited on Mar 30, 2026

No findings