<div align="center"> <img src="https://raw.githubusercontent.com/ovg-project/kvcached/refs/heads/main/assets/logo-v2.svg" alt="kvcached logo" height="96" /> <br> <br> <p> <a href="https://www.python.org/"><img alt="Python" src="https://img.shields.io/badge/Python-3.9%E2%80%933.13-blue"></a> <img alt="Engines" src="https://img.shields.io/badge/Engines-SGLang%20%7C%20vLLM-blueviolet"> <a href="https://yifanqiao.notion.site/Solve-the-GPU-Cost-Crisis-with-kvcached-289da9d1f4d68034b17bf2774201b141"><img alt="Blog" src="https://img.shields.io/badge/Blog-Read-FF5722?logo=rss&logoColor=white&labelColor=555555"></a> <a href="https://arxiv.org/abs/2508.08448"><img alt="arXiv: GPU OS vision" src="https://img.shields.io/badge/arXiv-GPU%20OS%20vision-b31b1b?logo=arxiv&logoColor=white&labelColor=555555"></a> <br> <a href="https://arxiv.org/abs/2505.04021"><img alt="arXiv: Multi LLM Serving" src="https://img.shields.io/badge/arXiv-Multi%20LLM%20Serving-b31b1b?logo=arxiv&logoColor=white&labelColor=555555"></a> <a href="https://join.slack.com/t/ovg-project/shared_invite/zt-3fr01t8s7-ZtDhHSJQ00hcLHgwKx3Dmw"><img alt="Slack Join" src="https://img.shields.io/badge/Slack-Join-4A154B?logo=slack&logoColor=white&labelColor=555555"></a> <a href="https://deepwiki.com/ovg-project/kvcached"><img alt="DeepWiki" src="https://img.shields.io/badge/DeepWiki-Docs-6B46C1?logo=book&logoColor=white&labelColor=555555"></a> <a href="LICENSE"><img alt="License" src="https://img.shields.io/badge/License-Apache_2.0-blue.svg"></a> </p> </div> <h2 align="center">Make GPU Sharing Flexible and Easy </h2> <p align="center"> <img src="https://raw.githubusercontent.com/ovg-project/kvcached/refs/heads/main/assets/ads.jpg" alt="Make GPU Sharing Flexible and Easy" width="500" /> </p>

kvcached (KV cache daemon) is a KV cache library for LLM serving/training on shared GPUs. By bringing OS-style virtual memory abstraction to LLM systems, it enables elastic and demand-driven KV cache allocation, improving GPU utilization under dynamic workloads.

kvcached achieves this by decoupling GPU virtual addressing from physical memory allocation for KV caches. It allows serving engines to initially reserve virtual memory only and later back it with physical GPU memory when the cache is actively used. This decoupling enables on-demand allocation and flexible sharing, bringing better GPU memory utilization under dynamic and mixed workloads. Check out more details in the blog.

<h3 align="left">Key Features</h3>

Elastic KV cache: allocate and reclaim KV memory dynamically to match live load.
GPU virtual memory: decouple logical KV from physical GPU memory via runtime mapping.
Memory control CLI: enforce memory limits with kvcached CLI.
Frontend router and sleep mode: route requests to the target models and put models to sleep when idle.
Support mainstream serving engines: integrate with SGLang and vLLM.

📢 Updates

[2026-03] Added pipeline parallelism support. MLA models (DeepSeek-V3, DeepSeek-V2 etc.) and GPT-OSS hybrid attention models (openai/gpt-oss-20b) are now also supported in vLLM. GPT-OSS support in SGLang updated to v0.5.9.
[2026-02] kvcached now supports vLLM v0.16.0 and SGLang v0.5.9. MLA models (DeepSeek-V3, DeepSeek-V2 etc.) are supported in SGLang with both page_size=1 and page_size>1. GPT-OSS hybrid attention models (openai/gpt-oss-20b) are supported in SGLang.

Supported engines and models

| Engine | Versions | Attention types | Example models | |--------|----------|-----------------|----------------| | SGLang | ≥ v0.4.9 (tested up to v0.5.9) | MHA / GQA / MLA | Llama 3.1/3.3, Qwen 2.5, DeepSeek-V3, openai/gpt-oss-20b, etc. | | vLLM | ≥ v0.8.4 (tested up to v0.16.0) | MHA / GQA / MLA | Llama 3.1/3.3, Qwen 2.5, DeepSeek-V3, openai/gpt-oss-20b |

Example use cases

<div align="center"> <table border="0" cellspacing="0" cellpadding="0" style="border: none; border-collapse: collapse; width: auto;"> <tr> <td align="left" style="border: none; vertical-align: middle; width: 196px;"> <img src="https://raw.githubusercontent.com/ovg-project/kvcached/refs/heads/main/assets/uc-multillm.svg" alt="Multi‑LLM serving" width="196" /> </td> <td align="left" style="border: none; vertical-align: middle; padding-left: 8px;"> <b>Multi‑LLM serving</b><br>kvcached allows multiple LLMs to share a GPU's memory elastically, enabling concurrent deployment without the rigid memory partitioning used today. This improves GPU utilization and saves serving costs. </td> </tr> <tr> <td align="left" style="border: none; vertical-align: middle; width: 196px;"> <img src="https://raw.githubusercontent.com/ovg-project/kvcached/refs/heads/main/assets/uc-serverless.svg" alt="Serverless LLM" width="196" /> </td> <td align="left" style="border: none; vertical-align: middle; padding-left: 8px;"> <b>Serverless LLM</b><br>By allocating KV cache only when needed, kvcached supports serverless deployments where models can spin up and down on demand. </td> </tr> <tr> <td align="left" style="border: none; vertical-align: middle; width: 196px;"> <img src="https://raw.githubusercontent.com/ovg-project/kvcached/refs/heads/main/assets/uc-compound.svg" alt="Compound AI systems" width="196" /> </td> <td align="left" style="border: none; vertical-align: middle; padding-left: 8px;"> <b>Compound AI systems</b><br>kvcached makes compound AI systems practical on limited hardware by elastically allocating memory across specialized models in a pipeline (e.g., retrieval, reasoning, and summarization). </td> </tr> <tr> <td align="left" style="border: none; vertical-align: middle; width: 196px;"> <img src="https://raw.githubusercontent.com/ovg-project/kvcached/refs/heads/main/assets/uc-colocate.svg" alt="GPU workload colocation" width="196" /> </td> <td align="left" style="border: none; vertical-align: middle; padding-left: 8px;"> <b>GPU workload colocation</b><br>kvcached allows LLM inference to coexist with other GPU workloads such as training jobs, fine-tuning, or vision models. </td> </tr> </table> </div>

See concrete examples here: kvcached/examples.

kvcached in action

The following simple example shows how kvcached enables an unmodified vLLM engine run with dynamically allocated memory.

Performance: Multi-LLM serving

kvcached enables dynamic memory sharing between LLMs, allowing them to share the same GPU memory elastically. As a comparison, the current serving engines need to statically reserve GPU memory at startup.

This benchmark shows the performance benefits of kvcached when serving three Llama-3.1-8B models on an A100-80G GPU under workloads with intermittent peaks. kvcached can achieve 2-28x TTFT reduction compared to the current serving engines. This performance gain can be converted to significant cost savings for LLM serving. Without kvcached, the systems have to provision more GPUs to achieve the same performance. Details can be found in benchmarks/bench_latency_benefit.

Installation

Prerequisites

Python (tested with 3.9 - 3.13)
SGLang (tested with v0.5.9) or vLLM (tested with v0.16.0)

kvcached can be installed as a plugin with existing SGLang or vLLM environment.

Install from PyPI

pip install kvcached --no-build-isolation

Install from source

# under the project root folder

pip install -e . --no-build-isolation --no-cache-dir
python tools/dev_copy_pth.py

Using Docker

kvcached installed with original engine dockers.

docker pull ghcr.io/ovg-project/kvcached-sglang:latest   # kvcached-v0.1.4-sglang-v0.5.9
docker pull ghcr.io/ovg-project/kvcached-vllm:latest     # kvcached-v0.1.4-vllm-v0.16.0

We prepare an all-in-one docker for developers:

docker pull ghcr.io/ovg-project/kvcached-dev:latest

More instructions can be found here. GB200 dockers are on the way.

Documentation

kvcached is indexed on DeepWiki for LLM-powered documentation.

The documentation covers:

Core architecture and memory management system
Integration with vLLM and SGLang
Multi-model serving and controller system
Deployment guides and configuration reference
Performance benchmarking and analysis
Development tools and testing

Testing

kvcached can be enabled by setting the following environmental variables:

export ENABLE_KVCACHED=true
export KVCACHED_AUTOPATCH=1

If you are using the engine-specific dockers, you can test kvcached by running the original engines' benchmark scripts. For example:

# for sglang
python -m sglang.launch_server --model meta-llama/Llama-3.2-1B-Instruct --disable-radix-cache --port 30000
python -m sglang.b

Kvcached

Install / Use

README