Llumnix

No description available

Generate Convert Improve

Install / Use

/learn @llumnix-project/Llumnix

About this skill

Quality Score

0/100

README

About

Llumnix is a full-stack solution for distributed LLM inference serving. It has been a key part of the LLM serving infrastructure of Alibaba Cloud PAI-EAS, a cloud-native inference serving platform, supporting production-grade inference deployments.

Llumnix provides key functionalities for modern distributed serving deployments (e.g., PD disaggregation, wide EP), such as LLM-specialized request gateway, intelligent and dynamic scheduling, high-performance KV cache transfer/storage support, etc. With a scheduler + rescheduler architecture and white-box scheduling design, Llumnix achieves fully dynamic request scheduling and pushes the performance of inference engines to the limit.

Note that with this new repository, we are re-architecting Llummix to a more modular and cloud-native design (Llumnix v1). The old Ray-based architecture (Llumnix v0) is a better choice for local deployments and quick prototyping and experimentation of scheduling ideas.

[Documentation]

Key Features

Scheduler + rescheduler architecture for fully dynamic request scheduling: initial routing + continuous migration
Advanced scheduling policies: load balancing, KV-aware, SLO/predictor-based scheduling, adaptive PD disaggregation, etc.
Dual-mode scheduling
- Full mode (white-box) for max performance with engine participation
- Lite mode (black-box) for engine-transparent deployments
Real-time instance status tracking for optimal scheduling quality
Modular, extensible policy framework for easily implementing and composing scheduling policies
LLM-specialized request gateway
- Tokenizers, diverse request routing / disaggregation protocols, batch inference
- Traffic management: splitting, mirroring, throttling, etc.
High-performance KV cache support (see llumnix-kv)
- Efficient, flexible data plane for KV cache transfer supporting diverse cache layouts and transport protocols (blade-kvt)
- Unified control plane for PD disaggregation, migration, KV storage (hybrid-connector)
High availability
- Fault tolerance for Llumnix components
- Engine health monitoring and reactive (re-)scheduling upon engine failures

Architecture

Llumnix is more than a "router". It has a full-stack design to support advanced scheduling features.

Components:

LlumSched: scheduler for initial scheduling and rescheduler for continuous rescheduling
Llumlet: an engine-side process that bridges global components and the inference engine
Cluster meta store: tracking realtime instance status
Engine: the inference engine (vLLM/SGLang) with Llumnix utility codes for scheduling enhancements (if using full mode)
Gateway: LLM-specialized capabilities, such as tokenizers, routing protocols, traffic management, batch inference
Hybrid Connector: unified KV cache control plane, using blade-kvt for KV transfer and external KV storage for offloading

Getting Started

View our documentation to learn more.

License

Llumnix is licensed under the Apache 2.0 License.

Related Skills

node-connect

348.0k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

108.8k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

348.0k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

348.0k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。