SepLLM
[ICML 2025] "SepLLM: Accelerate Large Language Models by Compressing One Segment into One Separator"
Install / Use
/learn @HKUDS/SepLLMREADME
</div>
Large Language Models (LLMs) have exhibited exceptional performance across a spectrum of natural language processing tasks. However, their substantial sizes pose considerable challenges, particularly in computational demands and inference speed, due to their quadratic complexity. In this work, we have identified a key pattern: certain seemingly meaningless separator tokens (i.e., punctuations) contribute disproportionately to attention scores compared to semantically meaningful tokens. This observation suggests that information of the segments between these separator tokens can be effectively condensed into the separator tokens themselves without significant information loss. Guided by this insight, we introduce SepLLM, a plug-and-play framework that accelerates inference by compressing these segments and eliminating redundant tokens. Additionally, we implement efficient kernels for training acceleration. Experimental results across training-free, training-from-scratch, and post-training settings demonstrate SepLLM's effectiveness. Notably, using the Llama-3-8B backbone, SepLLM achieves over 50% reduction in KV cache on the GSM8K-CoT benchmark while maintaining comparable performance. Furthermore, in streaming settings, SepLLM effectively processes sequences of up to 4 million tokens or more while maintaining consistent language modeling capabilities.

News
- :star2: [2025/07] The portable
SepCacheis available on HuggingFace's transformers' official repo now !! It is a plug-and-playCacheclass, and we also provide sample code for monkey patching, which supports Llama 3.1 series now. Note that the HuggingFace'sSepCacheversion otherwise needstransformers>=4.53.0,<4.54.0, i.e., the newtransformers. See Transformers Community for detailed usage. :rocket::rocket::rocket: - :star2: [2025/07] The portable
SepCacheis available on HuggingFace now !! It is a plug-and-playCacheclass, and we also provide sample code for monkey patching, which supports Llama 3.1 series now. Note that the HuggingFace'sSepCacheversion otherwise needstransformers>=4.53.0,<4.54.0, i.e., the newtransformers. We are working on integratingSepCacheinto HuggingFace's Transformers Community. Stay tuned! :rocket::rocket::rocket: - :star2: [2025/06] We are working on integrating
SepCacheinto HuggingFace's transformers. Stay tuned! :rocket::rocket::rocket: - :star2: [2025/06]
SepCacheis released, which is an efficient, portable, and easy-to-use Cache class for transformers. - :star2: [2025/06] SepLLM's trained checkpoint samples have been uploaded to HuggingFace. :rocket::rocket::rocket:
- :star: [2025/06] More features have already been supported by the SepLLM code repository, including BiPE (arXiv:2401.16421), Self-Adjust Softmax (arXiv:2502.18277), FixLLM, etc.
- :star: [2025/06] SepLLM's slides and videos are uploaded.
- :star: [2025/06] SepLLM's camera-ready paper is released.
- :star2: [2025/05] SepLLM has been accepted to ICML 2025. :rocket::rocket::rocket:
- :star: [2024/12] More exciting features are being developed. Stay tuned!
- :star: [2024/12] SepLLM's code has been released. Our codebase supports efficient multi-node distributed training with accelerated attention module Sep-Attention and also includes numerous existing Fusion Operators to accelerate the training process, such as fused rope (Su et al., 2023), fused layer norm, etc.
Attention Please!
-
Please pay extra attention to your usage and experimental scenarios, and choose the appropriate code subdirectory accordingly (i.e.,
TrainingFree-SepLLM,Training-SepLLM,Streaming-SepLLM). Some researchers have mistakenly used theStreaming-SepLLMfolder's code for general training-free tasks (e.g.,GSM8K_CoT,MMLU, etc.), which is incorrect. TheStreaming-SepLLMbranch requires "Positional Encoding Shifting" like StreamingLLM, whereas general training-free tasks do not, as the context length and generation length required by such general tasks usually do not exceed the maximum length (max_position_embeddings) pre-trained by the model. Besides, there are other detailed differences, which can be found in the code. Due to the above reasons, we refer toStreaming-SepLLMas the "Tailored Streaming Design" in the paper to distinguish it from the "Fundamental Design." (Although we have made these two settings (TrainingFree-SepLLM,Streaming-SepLLM) compatible inSepCache.) -
To achieve optimal performance on downstream tasks, training from scratch is required (to ensure consistency between training and inference). However, for many downstream tasks, the training-free setting can also deliver quite good performance.
-
Our wheel package
./package/transformers-4.38.0.post1+sepllm-py3-none-any.whlis an extension based on the officialtransformers-4.38.0. Its main purpose is to incorporate the modelsepllm_gpt_neoxand to adapt the relevant files of the Llama model under the directorytransformers/models/llama/to meet the requirements of theSepLLMarchitecture for training-free adaptability. Since the officialtransformers-4.38.0supports themeta-llama/Meta-Llama-3-8B-Instructmodel, our released./package/transformers-4.38.0.post1+sepllm-py3-none-any.whlalso supports it. However, as the officialtransformers-4.38.0does not directly supportmeta-llama/Llama-3.1-8B-Instruct(Llama 3.1 was developed ontransformers-4.42.3), our releasedtransformersdoes not support it either. You will need to manually migrate and adapt the relevant code of SepLLM to the llama-related code files intransformers-4.42.3(also undertransformers/models/llama/) in order to runLlama-3.1-8B-Instructdirectly. Fortunately, once you have understood thisREADME.mdand identified the modified parts of code and files needed for the training-free adaptation ofSepLLM, this migration should be straightforward! (Especially forSepCache, which can be mostly used via copy-paste!). Similarly easy adaptation for other newertransformers! -
:star2: [2025/07] The portable
SepCacheis available on HuggingFace now !! It is a plug-and-playCacheclass, and we also provide sample code for monkey patching, which supports Llama 3.1 series now. Note that the HuggingFace'sSepCacheversion otherwise needstransformers>=4.53.0,<4.54.0, i.e., the newtransformers. We are working on integratingSepCacheinto HuggingFace's Transformers Community. Stay tuned! :rocket::rocket::rocket:
