Pegaflow

High-performance KV cache storage for LLM inference — GPU offloading, SSD caching, and cross-node sharing via RDMA. Works with vLLM and SGLang.

Generate Convert Improve

Install / Use

/learn @novitalabs/Pegaflow

About this skill

Quality Score

0/100

README

Pegaflow

<div align="center"> <img src="./assets/logo.png" width="200" /> <p><strong><em>KV cache on the wings of Pegasus.</em></strong></p>

</div>

PegaFlow is a high-performance KV cache storage engine for LLM inference. Offload KV cache from GPU to host memory or SSD, and share it across nodes via RDMA.

Decoupled from inference lifecycle — runs as an independent sidecar; KV cache survives engine restarts, scales independently, and is shared across instances
Topology-aware, PCIe-saturating transfers — NUMA-aware pinned memory + layer-wise DMA to maximize hardware bandwidth
GIL-free Rust core — zero Python overhead on the hot path; your inference engine keeps its threads
Production-ready observability — built-in Prometheus metrics and OTLP export, not an afterthought
Pluggable — works with vLLM and SGLang as a drop-in KV connector

Framework Integration

| Framework | Status | Link | |-----------|--------|------| | vLLM | ✅ Ready | Quick Start | | SGLang | 🚧 Under Review | PR #17221 |

Quick Start

1. Install

uv pip install pegaflow-llm        # CUDA 12
uv pip install pegaflow-llm-cu13   # CUDA 13

2. Start PegaFlow Server

pegaflow-server

3. Launch your inference engine

vLLM (recommended):

vllm serve Qwen/Qwen3-0.6B \
  --kv-transfer-config '{"kv_connector": "PegaKVConnector", "kv_role": "kv_both", "kv_connector_module_path": "pegaflow.connector"}'

SGLang:

python3 -m sglang.launch_server \
  --model-path Qwen/Qwen3-0.6B \
  --enable-pegaflow

For full server options, multi-node setup, and advanced configuration, see Server Configuration.

Development

Build from source

export PYO3_PYTHON=$(which python)
export LD_LIBRARY_PATH=$(python -c "import sysconfig; print(sysconfig.get_config_var('LIBDIR'))"):$LD_LIBRARY_PATH

cargo run -r                    # start server
cd python && maturin develop -r # build Python bindings

We use Conventional Commits — run cz c for an interactive commit prompt.

Benchmarks

KV Cache Benchmark

H800 reference numbers with Llama-3.1-8B (8 prompts, 10K-token prefill, 1-token decode, 4.0 req/s):

| Configuration | TTFT mean (ms) | TTFT p99 (ms) | | --------------- | -------------- | ------------- | | PegaFlow (Cold) | 572.5 | 1113.7 | | PegaFlow (Warm) | 61.5 | 77.0 |

The warm-start path achieves ~9x faster TTFT compared to cold-start, demonstrating effective KV cache sharing across requests.

Documentation

Server Configuration — full CLI options, SSD cache, multi-node setup
P2P KV Cache Sharing — cross-node RDMA setup, tuning, and troubleshooting
P/D Router — prefill/decode disaggregation
vLLM I/O Patch — optional patch for better transfer throughput
Metrics — Prometheus and OTLP metrics reference
Goals & Non-Goals — project scope and design philosophy

Related Skills

node-connect

349.2k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

109.5k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

349.2k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

349.2k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。