SkillAgentSearch skills...

KVzip

[NeurIPS'25 Oral] Query-agnostic KV cache eviction: 3–4× reduction in memory and 2× decrease in latency (Qwen3/2.5, Gemma3, LLaMA3)

Install / Use

/learn @snu-mllab/KVzip
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction [NeurIPS'25 Oral]

[Paper] [Blog]

<img src="./images/method.png" width="800">

News

  • 01/2026: We've released Fast KVzip, which eliminates compression overhead and enhances both prefill and decoding efficiency!
  • 09/2025: 🎉 KVzip has been accepted at NeurIPS 2025 as an Oral Presentation!
  • 07/2025: NVIDIA KVpress adds support for KVzip (see also Leaderboard).
  • 07/2025: KVzip is presented at the ES-FoMo III ICML Workshop.
  • 05/2025: arXiv preprint is released.

Highlights

  • KVzip compresses the KV cache to support diverse future queries.
  • [Context-dependent] Achieve a 3–4× reduction in KV cache size and a 2× decrease in decoding latency, with minimal performance degradation.
  • [Context-independent] Enhance DuoAttention-style head-level KV compression, using only a few forward passes within one minute for head-level importance-score optimization (100x faster).
  • Run demo.py: <img src="./images/demo.png" width="800">

Benchmarking on a query-agnostic setting

<img src="./images/benchmark.png" width="800">

Installation

We used CUDA 12.1 and Python 3.10

cd KVzip
pip install -r requirements.txt
pip install flash-attn==2.7.4.post1 --no-build-isolation
make i

Quick Start

from model import ModelKVzip

model = ModelKVzip("Qwen/Qwen2.5-7B-Instruct-1M")
context = "This is my basic profile. My name is Kim living in Seoul. My major is computer science."
queries = ["What is my name?", "Do I live in Seoul?"]

kv = model.prefill(context, load_score=False)  # prefill KV cache + importance scoring
kv.prune(ratio=0.3)  # compression ratio, evict 70% KV

for q in queries:
    query_ids = model.apply_template(q)
    output = model.generate(query_ids, kv=kv, update_cache=False)  # efficient inference
    print(q, output)
  • Supported models are listed in model/load.py, including LLaMA3, Qwen2.5/3, Gemma3.
  • Set load_score=True to eliminate compression overhead. This enables context-independent KV eviction, with a trade-off in compression ratio of ratio=0.6.
  • After generation, KV pairs corresponding to the queries and generated tokens are selectively evicted from the cache for further processing. Set update_cache=True to enable multi-turn inference, retaining full interaction histories throughout the inference.

Profiling Memory and Computation Time

Context-dependent eviction

python -B test.py -m [model_name] -d [data_name] --kv_type evict --ratio 0.3
  • The code above also compares outputs generated with full versus pruned KV caches.
  • To quickly test, use -d squad. For long-context testing, use -d scbench_kv.
    • Available data names: data/load.py.
    • Available model names: model/load.py, e.g., llama3.1-8b, qwen2.5-7b (or Qwen/Qwen2.5-7B-Instruct-1M).
  • We adapt CUDA kernel from AdaKV, supporting non-uniform head budget allocation.
    • Currently, our code lacks an optimized kernel for Gemma3 which uses static KV cache, so the code does not yield actual efficiency gains. However, model performance can still be evaluated using reduced attention with KV subsampling (--kv_type retain).

Context-independent eviction (no runtime compression overhead)

  • Use the --level head flag with --ratio 0.6 (recommended).
    • We remove all context KV pairs associated with a specific head while retaining system prompt and query KV pairs.
    • Precomputed head scores are available for LLaMA3.1-8B and Qwen2.5-7/14B in ./utils/head_score.
  • To compute head scores for other models:
    python -B test.py -m [model_name] -d scbench_qa_eng --save_head_score
    
    • Results will be saved in ./utils/head_score.
    • If targeting a coding task, we recommend additionally running the command with -d scbench_repoqa. This allows the model to use the max head scores from both natural and coding languages, which improves performance.
  • These scores can be seamlessly integrated with DuoAttention's optimized inference engine by replacing their head score data with ours.

Evaluation

  • To generate model responses with KV compression ratios ranging from 0.1 to 1.0:
    python -B eval.py -m [model_name] -d [data_name] --kv_type retain --num 100
    
    • Results will be saved in ./results/[data_name].
    • Supported datasets are listed in data/load.py.
      • We used "scbench_{kv,prefix_suffix}_short" for Gemma3-12B evaluation (max_token_length issue).
      • We used "scbench_{kv,prefix_suffix,mf}_tiny" and "scbench_repoqa_short" for LLaMA3-8B-W8A8KV4 evaluation (lack of the base model capability issue).
  • To compute evaluation metrics from generated results:
    python -B -m results.parse -m [model_name] -d [data_name]
    

Applying to New Models

To integrate KVzip for a new model, you will need to update the following files:

  • attention/attn.py
    Modify the attention forward pass logic as needed. In certain cases, updates to kvcache.py and score.py may also be required.
  • model/monkeypatch.py
    Implement model-specific monkey patching for integration.
  • model/template.py
    Define the model's system prompt and chat formatting templates.

Citation

@article{kim2025kvzip,
        title={KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction},
        author={Kim, Jang-Hyun and Kim, Jinuk and Kwon, Sangwoo and Lee, Jae W and Yun, Sangdoo and Song, Hyun Oh},
        journal={Advances in Neural Information Processing Systems},
        year={2025}
}

License

MIT License

Related Skills

View on GitHub
GitHub Stars209
CategoryDevelopment
Updated2d ago
Forks10

Languages

Python

Security Score

100/100

Audited on Mar 27, 2026

No findings