LCKV

Layer-Condensed KV cache w/ 10 times larger batch size, fewer params and less computation. Dramatic speed up with better task performance. Accepted to ACL 2024.

Generate Convert Improve

Install / Use

/learn @whyNLP/LCKV

About this skill

Quality Score

0/100

README

Layer-Condensed KV Cache

<div align="center"> <img width="200" src="https://github.com/whyNLP/LCKV/assets/43395692/de271239-0096-4fd7-a578-59e57db916a2" /> <p> The KVs of the top layer <br> are the most informative and important. <br> So why bother caching the rest? </p> </div>

The code base for project Layer-Condensed KV Cache, a new variant of transformer decoders in which queries of all layers are paired with keys and values of just the top layer. It reduces the memory and computation cost, reduces the number of parameters, significantly improves the inference throughput with comparable or better task performance. The paper "Layer-Condensed KV Cache for Efficient Inference of Large Language Models" was accepted to ACL 2024 main conference.

This work is inspired by Probabilistic Transformer, where we consider the stacking layer structure of a transformer as an iterative process of improving token representation.

<details> <summary>The Map of AI Approaches</summary> <div align="center"> <img width="400" src="https://github.com/whyNLP/LCKV/assets/43395692/cdca6717-8a30-4e24-9b61-c8ad743bc092" /> </div> </details>

News

[25/01/23] Our paper "A Systematic Study of Cross-Layer KV Sharing for Efficient LLM Inference" was accepted to NAACL 2025 main conference.
[24/12/08] We release the main branch, with a general framework for Cross-Layer KV Sharing. A illustrative post can be found on PaperWeekly (in Chinese). See the published branch for the old version of the code.
[24/10/18] Our new empirical study "A Systematic Study of Cross-Layer KV Sharing for Efficient LLM Inference" has released on arXiv. A new configuration has been found to be more efficient than the original LCKV.
[24/05/28] This code base now also supports Cross-Layer Attention (CLA). The idea is similar, but they 1) divide the transformer layers into small groups with 2-4 layers in each group; 2) pairs the queries of all the layers with the keys and values of the bottom layer in each group. See details in their paper "Reducing Transformer Key-Value Cache Size with Cross-Layer Attention".
[24/05/20] LCKV initial paper and code release.
[24/05/12] Our paper "Layer-Condensed KV Cache for Efficient Inference of Large Language Models" was accepted to ACL 2024 main conference.
[24/02/14] Our paper "Layer-Condensed KV Cache for Efficient Inference of Large Language Models" was submitted to ARR February 2024 cycle.

Quick Start

We have released a series of pre-trained models described in our paper on HuggingFace. There is no need to clone this repo if you just want to use the pre-trained models. Load the model with the following code:

# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("text-generation", model="whynlp/tinyllama-lckv-w2-ft-100b", trust_remote_code=True)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("whynlp/tinyllama-lckv-w2-ft-100b", trust_remote_code=True)

See more models on the HuggingFace model hub. Note that these models are for research purposes only and may not be suitable for production.

| Model | Paper Section | Dev ppl. | Common-sense Reasoning | | --------------------------------------------------------------------------------------------- | ------------------------------ | -------- | ---------------------- | | whynlp/tinyllama-lckv-w10-ft-250b | -- | 7.939 | 50.86 | | whynlp/tinyllama-lckv-w2-ft-100b | Appendix C.1, Table 7 (line 5) | 8.514 | 49.55 | | whynlp/tinyllama-lckv-w10-100b | Section 3.2, Table 2 (line 3) | 9.265 | 46.84 | | whynlp/tinyllama-lckv-w2-100b | Section 3.2, Table 2 (line 2) | 9.746 | 45.45 |

Installation

You may install the dependencies with the following commands:

conda install pytorch pytorch-cuda=12.1 -c pytorch -c nvidia
pip install -r requirements.txt

where the CUDA version is set to 12.1. For other CUDA versions, please refer to installation instructions of PyTorch. See Trouble shooting for more details.

Usage

Our implementation is based on HuggingFace transformers. We register a new model lckv-llama that supports the Layer-Condensed KV Cache. It inherits from the llama model and adds support for the Layer-Condensed KV Cache.

[!NOTE] It is difficult to support the Layer-Condensed KV Cache for a variety of models with a small amount of code. This is because the Layer-Condensed KV Cache requires to modify the attention mechanism and training recipe of the transformer decoder. Currently, we only implemented the Layer-Condensed KV Cache for the llama model, and it is possible to extend it to other models with similar structures.

import models # register the lckv-llama model
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("huggyllama/llama-7b")
model = AutoModelForCausalLM.from_config(config="configs/tinyllama_lckv.json")

and now you have a randomly initialized model with the Layer-Condensed KV Cache.

Optimization

To accelerate the training and inference of the model, one could apply the liger kernel supported by transformers library. The provided training script run_clm.py has already activated the liger kernel. See more details here.

Configuration

We provide some sample configuration files in the configs folder. The config settings are defined in models/configuration_lckv.py. You may refer to this file for more details.

Option 1: Modify the configurations in python:

from models import LCKVLlamaConfig

# we have prepared a sample configuration file
config = LCKVLlamaConfig.from_pretrained("configs/tinyllama_lckv.json")

# below is the LCKV config. you may modify the configuration as you like
config.forward_passes  = 7      # m in the paper
config.backward_passes = 2      # b in the paper
config.layer_types     = "0_20_20_20_20_20_20_20_20_20_20_20_20_20_20_20_20_20_20_20_20_21" # for each layer, which layer to attend to

# we also support this
config.layer_types     = "0_10_10_10_10_10_10_10_10_10_10_10_10_10_10_10_10_10_10_10_10_21" # the sandwich-middle configuration
config.layer_types     = "0_1_2_3_4_5_6_7_8_9_10_11_12_13_14_15_16_17_18_19_20_21" # Llama config
config.layer_types     = "0_0_2_2_4_4_6_6_8_8_10_10_12_12_14_14_16_16_18_18_20_20" # CLA config

config.sliding_window  = 1024   # the window size for the sliding window attention
config.layer_types     = "0s_1s_2s_3s_4s_5s_6s_7s_8s_9s_10s_11_11_11_11_11_11_11_11_11_11_11" # YOCO config, 's' is for sliding window

config.sliding_window  = 1024   # the window size for the sliding window attention
config.layer_types     = "0_1s_1s_3s_3s_3s_0_7s_7s_9s_9s_9s_12_13s_13s_15s_15s_15s_12_19s_19s_19s" # MixAttention (Pairs) config

# we also support sequential training / inference, which will process the tokens one by one
# corresponding to LCKV paper Figure 2(a)
config.use_sequential = True

Option 2: Modify the configurations in the shell script (via `--config_overrides`):

accelerate launch run_clm.py \
    --config_name configs/tinyllama_lckv.json \
    --config_overrides forward_passes=7,backward_passes=2,layer_types=0_20_20_20_20_20_20_20_20_20_20_20_20_20_20_20_20_20_20_20_20_21 \
    ...

With the above configurations, you can create CLA, YOCO or any configurations in Cross-Layer KV Sharing or MixAttention without changing the code. The only thing you need to do is to write the correct layer_types in the configuration file.

Pre-training

We use the same training script as the original transformers library. You may refer to the official documentation for more details.

We provide a training script run_clm.sh for training a 50M parameter model on the wikitext-103 dataset. You may run the script with:

bash run_clm.sh

See the script for more details. For pretraining on SlimPajama, please follow the instructions in tinyllama-zh and replace the dataset with SlimPajama.

Initializing from a Pretrained Model

We may initialize our LCKV model from a pretrained model. Most parts of the model structure are consistent with the standard transformer model and we can directly inherit the weights. For the KV weights $W_K, W_V$, we mainly have 2 options:

Option 1: Directly Copy the Weights

Simply add --model_name_or_path to the training script:

accelerate launch run_clm.py \
    --model_name_or_path TinyLlama/TinyLlama-1.1B-intermediate-step-1195k-token-2.5T \
    --config configs/tinyllama_lckv.json \
    ...

See the s

Related Skills

node-connect

344.1k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

96.8k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

344.1k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

344.1k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。

whyNLP

View profile

View on GitHub

GitHub Stars157

CategoryDevelopment

Updated19d ago

Forks14

whyNLP/LCKV

Languages

Python

Security Score

80/100

Audited on Mar 12, 2026

No findings

LCKV

Install / Use

README

Layer-Condensed KV Cache

News

Quick Start

Installation

Usage

Optimization

Configuration

Option 1: Modify the configurations in python:

Option 2: Modify the configurations in the shell script (via --config_overrides):

Pre-training

Initializing from a Pretrained Model

Option 1: Directly Copy the Weights

Related Skills

Option 2: Modify the configurations in the shell script (via `--config_overrides`):