Llumnix
No description available
Install / Use
/learn @llumnix-project/LlumnixREADME
About
Llumnix is a full-stack solution for distributed LLM inference serving. It has been a key part of the LLM serving infrastructure of Alibaba Cloud PAI-EAS, a cloud-native inference serving platform, supporting production-grade inference deployments.
Llumnix provides key functionalities for modern distributed serving deployments (e.g., PD disaggregation, wide EP), such as LLM-specialized request gateway, intelligent and dynamic scheduling, high-performance KV cache transfer/storage support, etc. With a scheduler + rescheduler architecture and white-box scheduling design, Llumnix achieves fully dynamic request scheduling and pushes the performance of inference engines to the limit.
Note that with this new repository, we are re-architecting Llummix to a more modular and cloud-native design (Llumnix v1). The old Ray-based architecture (Llumnix v0) is a better choice for local deployments and quick prototyping and experimentation of scheduling ideas.
Key Features
- Scheduler + rescheduler architecture for fully dynamic request scheduling: initial routing + continuous migration
- Advanced scheduling policies: load balancing, KV-aware, SLO/predictor-based scheduling, adaptive PD disaggregation, etc.
- Dual-mode scheduling
- Full mode (white-box) for max performance with engine participation
- Lite mode (black-box) for engine-transparent deployments
- Real-time instance status tracking for optimal scheduling quality
- Modular, extensible policy framework for easily implementing and composing scheduling policies
- LLM-specialized request gateway
- Tokenizers, diverse request routing / disaggregation protocols, batch inference
- Traffic management: splitting, mirroring, throttling, etc.
- High-performance KV cache support (see llumnix-kv)
- Efficient, flexible data plane for KV cache transfer supporting diverse cache layouts and transport protocols (blade-kvt)
- Unified control plane for PD disaggregation, migration, KV storage (hybrid-connector)
- High availability
- Fault tolerance for Llumnix components
- Engine health monitoring and reactive (re-)scheduling upon engine failures
Architecture
Llumnix is more than a "router". It has a full-stack design to support advanced scheduling features.
<div align="center"> <img src="docs/source/image/architecture.png" width="70%" /> </div>Components:
- LlumSched: scheduler for initial scheduling and rescheduler for continuous rescheduling
- Llumlet: an engine-side process that bridges global components and the inference engine
- Cluster meta store: tracking realtime instance status
- Engine: the inference engine (vLLM/SGLang) with Llumnix utility codes for scheduling enhancements (if using full mode)
- Gateway: LLM-specialized capabilities, such as tokenizers, routing protocols, traffic management, batch inference
- Hybrid Connector: unified KV cache control plane, using blade-kvt for KV transfer and external KV storage for offloading
Getting Started
View our documentation to learn more.
License
Llumnix is licensed under the Apache 2.0 License.
Related Skills
node-connect
348.0kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
108.8kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
348.0kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
348.0kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
