Kthena
Kubernetes-native AI serving platform for scalable model serving.
Install / Use
/learn @volcano-sh/KthenaREADME
Kthena
<p align="center"> <img src="docs/proposal/images/kthena-arch.svg" alt="Kthena Architecture" width="800"/> </p> <p align="center"> <strong>The Enterprise-Grade LLM Serving Platform That Makes AI Infrastructure Simple, Scalable, and Cost-Efficient</strong> </p> <p align="center"> | <a href="https://kthena.volcano.sh/">Documentation</a> | <a href="https://kthena.volcano.sh/blog">Blog</a> | <a href="#">White Paper</a> | <a href="#">Slack</a> | </p> <div align="center"> </div>Overview
Kthena is a Kubernetes-native LLM inference platform that transforms how organizations deploy and manage Large Language Models in production. Built with declarative model lifecycle management and intelligent request routing, it provides high performance and enterprise-grade scalability for LLM inference workloads.
The platform extends Kubernetes with purpose-built Custom Resource Definitions (CRDs) for managing LLM workloads, supporting multiple inference engines (vLLM, SGLang, Triton) and advanced serving patterns like prefill-decode disaggregation. Kthena's architecture separates control plane operations (model lifecycle, autoscaling policies) from data plane traffic routing through an intelligent router, enabling teams to manage complex LLM deployments with familiar cloud-native patterns while delivering cost-driven autoscaling, heterogeneous accelerators support, and multi-backend inference engines.
Key Features
Production-Ready LLM Serving
Deploy and scale Large Language Models with enterprise-grade reliability, supporting vLLM, SGLang, Triton, and TorchServe inference engines through consistent Kubernetes-native APIs.
Simplified LLM Management
- Prefill-Decode Disaggregation: Separate compute-intensive prefill operations from token generation decode processes to optimize hardware utilization and meet latency-based SLOs.
- Cost-Driven Autoscaling: Intelligent scaling based on multiple metrics (CPU, GPU, memory, custom) with configurable budget constraints and cost optimization policies
- Zero-Downtime Updates: Rolling model updates with configurable strategies
- Dynamic LoRA Management: Hot-swap adapters without service interruption
Built-in Network Topology-Aware Scheduling
Network topology-aware scheduling places inference instances within the same network domain to maximize inter-instance communication bandwidth and enhance inference performance.
Built-in Gang Scheduling
Gang scheduling ensures atomic scheduling of distributed inference groups like xPyD, preventing resource waste from partial deployments.
Intelligent Routing & Traffic Control
- Multi-model routing with pluggable load-balancing algorithms, including model load aware and KV-cache aware strategies.
- PD group aware request distribution for xPyD (x-prefill/y-decode) deployment patterns.
- Rich traffic policies, including canary releases, weighted traffic distribution, token-based rate limiting, and automated failover.
- LoRA adapter aware routing without inference outage
Architecture
Kthena implements a Kubernetes-native architecture with separate control plane and data plane components, each can be deployed and used alone. The platform manages LLM inference workloads through CRD and provides intelligent request routing through a dedicated router. It contains the following key components:
- Kthena-controller-manager: The control plane component governing the LLM inference lifecycle. It continuously reconciles Kthena CRDs to deploy, scale, and upgrade inference replicas across the cluster while exposing advanced scheduling policies that integrate directly with the Volcano scheduler.
- Kthena-router: The data plane entry point for inference traffic. It classifies each request by model name, custom headers, or URI patterns, then applies load-balancing policies and traffic controls to dispatch request to the right inference instance. Native support for prefill–decode disaggregation routing while keeps high throughput and low latency.
For more details, please refer to Kthena Architecture
[!Note] The router component is a reference implementation, because Gateway Inference Extension does not natively support prefill-decode distribution. kthena router is still under active iteration, and it can be deployed behind a standard api gateway.
Getting Started
Get up and running with Kthena in minutes. This guide will walk you through installing the platform and deploying your first LLM model.
Install from code
If you don't have a kubernetes cluster, try one-click install from code base:
./hack/local-up-kthena.sh
Run ./hack/local-up-kthena.sh --help for more options.
Community
Kthena is an open source project that welcomes contributions from developers, platform engineers, and AI practitioners.
Get Involved:
- Issues: Report bugs and request features on GitHub Issues
- Discussions: Join conversations on GitHub Discussions
- Documentation: Help improve guides and examples
Contributing
Contributions are welcome! Here's how to get started:
Contribution Guidelines
- Code: Follow Go conventions and include tests for new features
- Documentation: Update relevant docs and examples
- Issues: Use GitHub Issues for bug reports and feature requests
- Pull Requests: Ensure CI passes and include clear descriptions
See CONTRIBUTING.md for detailed guidelines.
License
Kthena is licensed under the Apache 2.0 License.
Related Skills
node-connect
348.0kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
108.8kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
348.0kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
348.0kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
