SkillAgentSearch skills...

LCKV

Layer-Condensed KV cache w/ 10 times larger batch size, fewer params and less computation. Dramatic speed up with better task performance. Accepted to ACL 2024.

Install / Use

/learn @whyNLP/LCKV
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

Layer-Condensed KV Cache

<div align="center"> <img width="200" src="https://github.com/whyNLP/LCKV/assets/43395692/de271239-0096-4fd7-a578-59e57db916a2" /> <p> The KVs of the top layer <br> are the most informative and important. <br> So why bother caching the rest? </p> </div>

The code base for project Layer-Condensed KV Cache, a new variant of transformer decoders in which queries of all layers are paired with keys and values of just the top layer. It reduces the memory and computation cost, reduces the number of parameters, significantly improves the inference throughput with comparable or better task performance. The paper "Layer-Condensed KV Cache for Efficient Inference of Large Language Models" was accepted to ACL 2024 main conference.

This work is inspired by Probabilistic Transformer, where we consider the stacking layer structure of a transformer as an iterative process of improving token representation.

<details> <summary>The Map of AI Approaches</summary> <div align="center"> <img width="400" src="https://github.com/whyNLP/LCKV/assets/43395692/cdca6717-8a30-4e24-9b61-c8ad743bc092" /> </div> </details>

News

Quick Start

We have released a series of pre-trained models described in our paper on HuggingFace. There is no need to clone this repo if you just want to use the pre-trained models. Load the model with the following code:

# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("text-generation", model="whynlp/tinyllama-lckv-w2-ft-100b", trust_remote_code=True)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("whynlp/tinyllama-lckv-w2-ft-100b", trust_remote_code=True)

See more models on the HuggingFace model hub. Note that these models are for research purposes only and may not be suitable for production.

| Model | Paper Section | Dev ppl. | Common-sense Reasoning | | --------------------------------------------------------------------------------------------- | ------------------------------ | -------- | ---------------------- | | whynlp/tinyllama-lckv-w10-ft-250b | -- | 7.939 | 50.86 | | whynlp/tinyllama-lckv-w2-ft-100b | Appendix C.1, Table 7 (line 5) | 8.514 | 49.55 | | whynlp/tinyllama-lckv-w10-100b | Section 3.2, Table 2 (line 3) | 9.265 | 46.84 | | whynlp/tinyllama-lckv-w2-100b | Section 3.2, Table 2 (line 2) | 9.746 | 45.45 |

Installation

You may install the dependencies with the following commands:

conda install pytorch pytorch-cuda=12.1 -c pytorch -c nvidia
pip install -r requirements.txt

where the CUDA version is set to 12.1. For other CUDA versions, please refer to installation instructions of PyTorch. See Trouble shooting for more details.

Usage

Our implementation is based on HuggingFace transformers. We register a new model lckv-llama that supports the Layer-Condensed KV Cache. It inherits from the llama model and adds support for the Layer-Condensed KV Cache.

[!NOTE] It is difficult to support the Layer-Condensed KV Cache for a variety of models with a small amount of code. This is because the Layer-Condensed KV Cache requires to modify the attention mechanism and training recipe of the transformer decoder. Currently, we only implemented the Layer-Condensed KV Cache for the llama model, and it is possible to extend it to other models with similar structures.

import models # register the lckv-llama model
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("huggyllama/llama-7b")
model = AutoModelForCausalLM.from_config(config="configs/tinyllama_lckv.json")

and now you have a randomly initialized model with the Layer-Condensed KV Cache.

Optimization

To accelerate the training and inference of the model, one could apply the liger kernel supported by transformers library. The provided training script run_clm.py has already activated the liger kernel. See more details here.

Configuration

We provide some sample configuration files in the configs folder. The config settings are defined in models/configuration_lckv.py. You may refer to this file for more details.

Option 1: Modify the configurations in python:

from models import LCKVLlamaConfig

# we have prepared a sample configuration file
config = LCKVLlamaConfig.from_pretrained("configs/tinyllama_lckv.json")

# below is the LCKV config. you may modify the configuration as you like
config.forward_passes  = 7      # m in the paper
config.backward_passes = 2      # b in the paper
config.layer_types     = "0_20_20_20_20_20_20_20_20_20_20_20_20_20_20_20_20_20_20_20_20_21" # for each layer, which layer to attend to

# we also support this
config.layer_types     = "0_10_10_10_10_10_10_10_10_10_10_10_10_10_10_10_10_10_10_10_10_21" # the sandwich-middle configuration
config.layer_types     = "0_1_2_3_4_5_6_7_8_9_10_11_12_13_14_15_16_17_18_19_20_21" # Llama config
config.layer_types     = "0_0_2_2_4_4_6_6_8_8_10_10_12_12_14_14_16_16_18_18_20_20" # CLA config

config.sliding_window  = 1024   # the window size for the sliding window attention
config.layer_types     = "0s_1s_2s_3s_4s_5s_6s_7s_8s_9s_10s_11_11_11_11_11_11_11_11_11_11_11" # YOCO config, 's' is for sliding window

config.sliding_window  = 1024   # the window size for the sliding window attention
config.layer_types     = "0_1s_1s_3s_3s_3s_0_7s_7s_9s_9s_9s_12_13s_13s_15s_15s_15s_12_19s_19s_19s" # MixAttention (Pairs) config

# we also support sequential training / inference, which will process the tokens one by one
# corresponding to LCKV paper Figure 2(a)
config.use_sequential = True

Option 2: Modify the configurations in the shell script (via --config_overrides):

accelerate launch run_clm.py \
    --config_name configs/tinyllama_lckv.json \
    --config_overrides forward_passes=7,backward_passes=2,layer_types=0_20_20_20_20_20_20_20_20_20_20_20_20_20_20_20_20_20_20_20_20_21 \
    ...

With the above configurations, you can create CLA, YOCO or any configurations in Cross-Layer KV Sharing or MixAttention without changing the code. The only thing you need to do is to write the correct layer_types in the configuration file.

Pre-training

We use the same training script as the original transformers library. You may refer to the official documentation for more details.

We provide a training script run_clm.sh for training a 50M parameter model on the wikitext-103 dataset. You may run the script with:

bash run_clm.sh

See the script for more details. For pretraining on SlimPajama, please follow the instructions in tinyllama-zh and replace the dataset with SlimPajama.

Initializing from a Pretrained Model

We may initialize our LCKV model from a pretrained model. Most parts of the model structure are consistent with the standard transformer model and we can directly inherit the weights. For the KV weights $W_K, W_V$, we mainly have 2 options:

Option 1: Directly Copy the Weights

Simply add --model_name_or_path to the training script:

accelerate launch run_clm.py \
    --model_name_or_path TinyLlama/TinyLlama-1.1B-intermediate-step-1195k-token-2.5T \
    --config configs/tinyllama_lckv.json \
    ...

See the s

Related Skills

View on GitHub
GitHub Stars157
CategoryDevelopment
Updated19d ago
Forks14

Languages

Python

Security Score

80/100

Audited on Mar 12, 2026

No findings