Pegaflow
High-performance KV cache storage for LLM inference — GPU offloading, SSD caching, and cross-node sharing via RDMA. Works with vLLM and SGLang.
Install / Use
/learn @novitalabs/PegaflowREADME
Pegaflow
<div align="center"> <img src="./assets/logo.png" width="200" /> <p><strong><em>KV cache on the wings of Pegasus.</em></strong></p> </div>PegaFlow is a high-performance KV cache storage engine for LLM inference. Offload KV cache from GPU to host memory or SSD, and share it across nodes via RDMA.
- Decoupled from inference lifecycle — runs as an independent sidecar; KV cache survives engine restarts, scales independently, and is shared across instances
- Topology-aware, PCIe-saturating transfers — NUMA-aware pinned memory + layer-wise DMA to maximize hardware bandwidth
- GIL-free Rust core — zero Python overhead on the hot path; your inference engine keeps its threads
- Production-ready observability — built-in Prometheus metrics and OTLP export, not an afterthought
- Pluggable — works with vLLM and SGLang as a drop-in KV connector
Framework Integration
| Framework | Status | Link | |-----------|--------|------| | vLLM | ✅ Ready | Quick Start | | SGLang | 🚧 Under Review | PR #17221 |
Quick Start
1. Install
uv pip install pegaflow-llm # CUDA 12
uv pip install pegaflow-llm-cu13 # CUDA 13
2. Start PegaFlow Server
pegaflow-server
3. Launch your inference engine
vLLM (recommended):
vllm serve Qwen/Qwen3-0.6B \
--kv-transfer-config '{"kv_connector": "PegaKVConnector", "kv_role": "kv_both", "kv_connector_module_path": "pegaflow.connector"}'
SGLang:
python3 -m sglang.launch_server \
--model-path Qwen/Qwen3-0.6B \
--enable-pegaflow
For full server options, multi-node setup, and advanced configuration, see Server Configuration.
Development
Build from source
export PYO3_PYTHON=$(which python)
export LD_LIBRARY_PATH=$(python -c "import sysconfig; print(sysconfig.get_config_var('LIBDIR'))"):$LD_LIBRARY_PATH
cargo run -r # start server
cd python && maturin develop -r # build Python bindings
We use Conventional Commits — run cz c for an interactive commit prompt.
Benchmarks
KV Cache Benchmark
H800 reference numbers with Llama-3.1-8B (8 prompts, 10K-token prefill, 1-token decode, 4.0 req/s):
| Configuration | TTFT mean (ms) | TTFT p99 (ms) | | --------------- | -------------- | ------------- | | PegaFlow (Cold) | 572.5 | 1113.7 | | PegaFlow (Warm) | 61.5 | 77.0 |
The warm-start path achieves ~9x faster TTFT compared to cold-start, demonstrating effective KV cache sharing across requests.
Documentation
- Server Configuration — full CLI options, SSD cache, multi-node setup
- P2P KV Cache Sharing — cross-node RDMA setup, tuning, and troubleshooting
- P/D Router — prefill/decode disaggregation
- vLLM I/O Patch — optional patch for better transfer throughput
- Metrics — Prometheus and OTLP metrics reference
- Goals & Non-Goals — project scope and design philosophy
Related Skills
node-connect
349.2kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
109.5kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
349.2kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
349.2kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
