LCKV
Layer-Condensed KV cache w/ 10 times larger batch size, fewer params and less computation. Dramatic speed up with better task performance. Accepted to ACL 2024.
Install / Use
/learn @whyNLP/LCKVREADME
Layer-Condensed KV Cache
<div align="center"> <img width="200" src="https://github.com/whyNLP/LCKV/assets/43395692/de271239-0096-4fd7-a578-59e57db916a2" /> <p> The KVs of the top layer <br> are the most informative and important. <br> So why bother caching the rest? </p> </div>The code base for project Layer-Condensed KV Cache, a new variant of transformer decoders in which queries of all layers are paired with keys and values of just the top layer. It reduces the memory and computation cost, reduces the number of parameters, significantly improves the inference throughput with comparable or better task performance. The paper "Layer-Condensed KV Cache for Efficient Inference of Large Language Models" was accepted to ACL 2024 main conference.
This work is inspired by Probabilistic Transformer, where we consider the stacking layer structure of a transformer as an iterative process of improving token representation.
<details> <summary>The Map of AI Approaches</summary> <div align="center"> <img width="400" src="https://github.com/whyNLP/LCKV/assets/43395692/cdca6717-8a30-4e24-9b61-c8ad743bc092" /> </div> </details>News
- [25/01/23] Our paper "A Systematic Study of Cross-Layer KV Sharing for Efficient LLM Inference" was accepted to NAACL 2025 main conference.
- [24/12/08] We release the main branch, with a general framework for Cross-Layer KV Sharing. A illustrative post can be found on PaperWeekly (in Chinese). See the published branch for the old version of the code.
- [24/10/18] Our new empirical study "A Systematic Study of Cross-Layer KV Sharing for Efficient LLM Inference" has released on arXiv. A new configuration has been found to be more efficient than the original LCKV.
- [24/05/28] This code base now also supports Cross-Layer Attention (CLA). The idea is similar, but they 1) divide the transformer layers into small groups with 2-4 layers in each group; 2) pairs the queries of all the layers with the keys and values of the bottom layer in each group. See details in their paper "Reducing Transformer Key-Value Cache Size with Cross-Layer Attention".
- [24/05/20] LCKV initial paper and code release.
- [24/05/12] Our paper "Layer-Condensed KV Cache for Efficient Inference of Large Language Models" was accepted to ACL 2024 main conference.
- [24/02/14] Our paper "Layer-Condensed KV Cache for Efficient Inference of Large Language Models" was submitted to ARR February 2024 cycle.
Quick Start
We have released a series of pre-trained models described in our paper on HuggingFace. There is no need to clone this repo if you just want to use the pre-trained models. Load the model with the following code:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("text-generation", model="whynlp/tinyllama-lckv-w2-ft-100b", trust_remote_code=True)
# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("whynlp/tinyllama-lckv-w2-ft-100b", trust_remote_code=True)
See more models on the HuggingFace model hub. Note that these models are for research purposes only and may not be suitable for production.
| Model | Paper Section | Dev ppl. | Common-sense Reasoning | | --------------------------------------------------------------------------------------------- | ------------------------------ | -------- | ---------------------- | | whynlp/tinyllama-lckv-w10-ft-250b | -- | 7.939 | 50.86 | | whynlp/tinyllama-lckv-w2-ft-100b | Appendix C.1, Table 7 (line 5) | 8.514 | 49.55 | | whynlp/tinyllama-lckv-w10-100b | Section 3.2, Table 2 (line 3) | 9.265 | 46.84 | | whynlp/tinyllama-lckv-w2-100b | Section 3.2, Table 2 (line 2) | 9.746 | 45.45 |
Installation
You may install the dependencies with the following commands:
conda install pytorch pytorch-cuda=12.1 -c pytorch -c nvidia
pip install -r requirements.txt
where the CUDA version is set to 12.1. For other CUDA versions, please refer to installation instructions of PyTorch. See Trouble shooting for more details.
Usage
Our implementation is based on HuggingFace transformers. We register a new model lckv-llama that supports the Layer-Condensed KV Cache. It inherits from the llama model and adds support for the Layer-Condensed KV Cache.
[!NOTE] It is difficult to support the Layer-Condensed KV Cache for a variety of models with a small amount of code. This is because the Layer-Condensed KV Cache requires to modify the attention mechanism and training recipe of the transformer decoder. Currently, we only implemented the Layer-Condensed KV Cache for the
llamamodel, and it is possible to extend it to other models with similar structures.
import models # register the lckv-llama model
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("huggyllama/llama-7b")
model = AutoModelForCausalLM.from_config(config="configs/tinyllama_lckv.json")
and now you have a randomly initialized model with the Layer-Condensed KV Cache.
Optimization
To accelerate the training and inference of the model, one could apply the liger kernel supported by transformers library. The provided training script run_clm.py has already activated the liger kernel. See more details here.
Configuration
We provide some sample configuration files in the configs folder. The config settings are defined in models/configuration_lckv.py. You may refer to this file for more details.
Option 1: Modify the configurations in python:
from models import LCKVLlamaConfig
# we have prepared a sample configuration file
config = LCKVLlamaConfig.from_pretrained("configs/tinyllama_lckv.json")
# below is the LCKV config. you may modify the configuration as you like
config.forward_passes = 7 # m in the paper
config.backward_passes = 2 # b in the paper
config.layer_types = "0_20_20_20_20_20_20_20_20_20_20_20_20_20_20_20_20_20_20_20_20_21" # for each layer, which layer to attend to
# we also support this
config.layer_types = "0_10_10_10_10_10_10_10_10_10_10_10_10_10_10_10_10_10_10_10_10_21" # the sandwich-middle configuration
config.layer_types = "0_1_2_3_4_5_6_7_8_9_10_11_12_13_14_15_16_17_18_19_20_21" # Llama config
config.layer_types = "0_0_2_2_4_4_6_6_8_8_10_10_12_12_14_14_16_16_18_18_20_20" # CLA config
config.sliding_window = 1024 # the window size for the sliding window attention
config.layer_types = "0s_1s_2s_3s_4s_5s_6s_7s_8s_9s_10s_11_11_11_11_11_11_11_11_11_11_11" # YOCO config, 's' is for sliding window
config.sliding_window = 1024 # the window size for the sliding window attention
config.layer_types = "0_1s_1s_3s_3s_3s_0_7s_7s_9s_9s_9s_12_13s_13s_15s_15s_15s_12_19s_19s_19s" # MixAttention (Pairs) config
# we also support sequential training / inference, which will process the tokens one by one
# corresponding to LCKV paper Figure 2(a)
config.use_sequential = True
Option 2: Modify the configurations in the shell script (via --config_overrides):
accelerate launch run_clm.py \
--config_name configs/tinyllama_lckv.json \
--config_overrides forward_passes=7,backward_passes=2,layer_types=0_20_20_20_20_20_20_20_20_20_20_20_20_20_20_20_20_20_20_20_20_21 \
...
With the above configurations, you can create CLA, YOCO or any configurations in Cross-Layer KV Sharing or MixAttention without changing the code. The only thing you need to do is to write the correct layer_types in the configuration file.
Pre-training
We use the same training script as the original transformers library. You may refer to the official documentation for more details.
We provide a training script run_clm.sh for training a 50M parameter model on the wikitext-103 dataset. You may run the script with:
bash run_clm.sh
See the script for more details. For pretraining on SlimPajama, please follow the instructions in tinyllama-zh and replace the dataset with SlimPajama.
Initializing from a Pretrained Model
We may initialize our LCKV model from a pretrained model. Most parts of the model structure are consistent with the standard transformer model and we can directly inherit the weights. For the KV weights $W_K, W_V$, we mainly have 2 options:
Option 1: Directly Copy the Weights
Simply add --model_name_or_path to the training script:
accelerate launch run_clm.py \
--model_name_or_path TinyLlama/TinyLlama-1.1B-intermediate-step-1195k-token-2.5T \
--config configs/tinyllama_lckv.json \
...
See the s
Related Skills
node-connect
344.1kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
96.8kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
344.1kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
344.1kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
