Arks
Arks is a cloud-native inference framework running on Kubernetes
Install / Use
/learn @scitix/ArksREADME
Arks
简体中文 | English
Arks is an end-to-end framework for managing LLM-based applications within Kubernetes clusters. It provides a robust and extensible infrastructure tailored for deploying, orchestrating, and scaling LLM inference workloads in cloud-native environments.
Key Features
Distributed Inference
- Multi-node scheduling: Run inference across multiple compute nodes.
- Heterogeneous computing support: Works across different hardware types (CPU, GPU, etc.).
- Multi-engine compatibility: Supports vLLM, SGLang, and Dynamo.
- Auto service discovery & load balancing: Dynamically register and balance application traffic.
- Automatic weight adjustment: Adapt to traffic and resource demands in real-time.
- Horizontal Pod Autoscaling (HPA): Autoscale applications based on workload.
Model Management
- Model caching & optimization: Efficiently download and cache models to reduce cold-start latency.
- Model sharing: Share models across inference nodes to save bandwidth and memory.
- Accelerated loading: Leverage local cache and preloaded strategies for fast startup.
Multi-Tenant Management
- Fine-grained API Token control: Issue and manage tokens with scoped permissions.
- Flexible quota strategies: Enforce usage limits by total token count or pricing-based policies.
- Request throttling: Support rate limiting by TPM (tokens per minute) and RPM (requests per minute) and more rate limiting stategies.
Architecture

Arks consists of the following major components:
- Gateway Layer: Acts as the unified entry point for all external traffic. It handles request routing and enforces access policies.
- ArksToken: Provides fine-grained multi-tenant access control with support for:
- API token-based authentication
- Quota enforcement (based on token usage or pricing)
- Rate limiting (TPM, RPM)
- ArksEndpoint: Dynamically manages routing rules and traffic distribution across different ArksApplication instances.
- Supports dynamic weight-based routing
- Enables automatic application discovery
- Adjusts traffic flow in real-time based on load or policies
- ArksToken: Provides fine-grained multi-tenant access control with support for:
- Workload Layer: Each ArksApplication contains one or more runtime instances. Supported runtimes include vLLM, SGLang, Dynamo.
Each runtime is deployed as a Kubernetes workload and benefits from:
- Distributed inference across multiple nodes
- Support for heterogeneous computing environments
- Autoscaling via Kubernetes HPA, based on predefined SLOs
- Storage Layer: Using ArksModel to manage model storage.
- Supports auto caching of models to reduce cold start time
- Enables model sharing across applications and nodes
- Designed for high-throughput model loading and reuse
More docs:
Quick Start
Prerequisites
- Kubernetes cluster (v1.20+)
- kubectl configured to access your cluster
Installation
Note: Arks requires LWS v0.7.0 and RBGS v0.5.0-alpha.4. Install LWS before RBGS.
# Install dependencies (skip if already installed with correct version)
kubectl apply --server-side -f https://github.com/envoyproxy/gateway/releases/download/v1.2.8/install.yaml
kubectl apply --server-side -f https://github.com/kubernetes-sigs/lws/releases/download/v0.7.0/manifests.yaml
kubectl apply --server-side -f https://raw.githubusercontent.com/sgl-project/rbg/v0.5.0-alpha.4/deploy/kubectl/manifests.yaml
# Install Arks
git clone https://github.com/scitix/arks.git
cd arks
kubectl apply --server-side -f dist/operator.yaml
kubectl apply --server-side -f dist/gateway.yaml
verification:
# Check all component status, should be ready
kubectl get deployment -n arks-operator-system
---
NAME READY UP-TO-DATE AVAILABLE AGE
arks-gateway-plugins 1/1 1 1 22h
arks-operator-controller-manager 1/1 1 1 22h
arks-redis-master 1/1 1 1 22h
# Check Envoy Gateway status
kubectl get deployment -n envoy-gateway-system
---
NAME READY UP-TO-DATE AVAILABLE AGE
envoy-arks-operator-system-arks-eg-abcedefg 1/1 1 1 22h
envoy-gateway 1/1 1 1 22h
Examples
Install with:
kubectl create -f examples/quickstart/quickstart.yaml
Check resources ready:
# Check all ARKS custom resources
kubectl get arksapplication,arksendpoint,arksmodel,arksquota,arkstoken,httproute -owide
---
# REPLICAS should equals to READY, PHASE should be Running
NAME PHASE REPLICAS READY AGE MODEL RUNTIME DRIVER
arksapplication.arks.ai/app-qwen Running 1 1 21m qwen-7b sglang
NAME AGE DEFAULT WEIGHT
arksendpoint.arks.ai/qwen-7b 21m 5
# PHASE should be Ready
NAME AGE MODEL PHASE
arksmodel.arks.ai/qwen-7b 21m Qwen/Qwen2.5-7B-Instruct-1M Ready
NAME AGE
arksquota.arks.ai/basic-quota 21m
NAME AGE
arkstoken.arks.ai/example-token 21m
NAME HOSTNAMES AGE
httproute.gateway.networking.k8s.io/qwen-7b 21m
Testing
Get the gateway IP:
# Option 1: Kubernetes cluster with LoadBalancer support
LB_IP=$(kubectl get svc -n envoy-gateway-system --selector=gateway.envoyproxy.io/owning-gateway-name=arks-eg -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
ENDPOINT="${LB_IP}:80"
# Option 2: Dev environment without LoadBalancer support. Use port forwarding way instead
ENVOY_SERVICE=$(kubectl get svc -n envoy-gateway-system --selector=gateway.envoyproxy.io/owning-gateway-name=arks-eg -o jsonpath='{.items[0].metadata.name}')
kubectl -n envoy-gateway-system port-forward service/${ENVOY_SERVICE} 8888:80 &
ENDPOINT="localhost:8888"
Curl the example app through Envoy proxy:
curl http://${ENDPOINT}/v1/chat/completions -k \
-H "Authorization: Bearer sk-test123456" \
-H "Content-Type: application/json" \
-d '{"model": "qwen-7b", "messages": [{"role": "user", "content": "Hello, who are you?"}]}'
Expected response
{
"id":"xxxxxxxxx",
"object":"chat.completion",
"created": 12332454,
"model":"qwen-7b",
"choices":[{
"index":0,
"message":{
"role":"assistant",
"content":"I'm a large language model created by Alibaba Cloud. I go by the name Qwen.",
"reasoning_content":null,
"tool_calls":null
},
"logprobs":null,
"finish_reason":"stop",
"matched_stop":151645
}],
"usage":{
"prompt_tokens":25,
"total_tokens":45,
"completion_tokens":20,
"prompt_tokens_details":null
}
}
Clean-Up
kubectl delete -f examples/quickstart/quickstart.yaml --ignore-not-found=true
kubectl delete -f dist/gateway.yaml
kubectl delete -f dist/operator.yaml
Build
It is recommended to compile ARKS using Docker. Here are the relevant commands:
make docker-build-operator
make docker-build-gateway
make docker-build-scripts
License
Arks is licensed under the Apache 2.0 License.
Community, discussion, contribution, and support
For feedback, questions, or contributions, feel free to:
- Open an issue on GitHub
- Submit a pull request
Related Skills
node-connect
349.7kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
109.7kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
349.7kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
349.7kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
