Arks

Arks is a cloud-native inference framework running on Kubernetes

Generate Convert Improve

Install / Use

/learn @scitix/Arks

About this skill

Quality Score

0/100

README

Arks

简体中文 | English

Arks is an end-to-end framework for managing LLM-based applications within Kubernetes clusters. It provides a robust and extensible infrastructure tailored for deploying, orchestrating, and scaling LLM inference workloads in cloud-native environments.

Key Features

Distributed Inference

Multi-node scheduling: Run inference across multiple compute nodes.
Heterogeneous computing support: Works across different hardware types (CPU, GPU, etc.).
Multi-engine compatibility: Supports vLLM, SGLang, and Dynamo.
Auto service discovery & load balancing: Dynamically register and balance application traffic.
Automatic weight adjustment: Adapt to traffic and resource demands in real-time.
Horizontal Pod Autoscaling (HPA): Autoscale applications based on workload.

Model Management

Model caching & optimization: Efficiently download and cache models to reduce cold-start latency.
Model sharing: Share models across inference nodes to save bandwidth and memory.
Accelerated loading: Leverage local cache and preloaded strategies for fast startup.

Multi-Tenant Management

Fine-grained API Token control: Issue and manage tokens with scoped permissions.
Flexible quota strategies: Enforce usage limits by total token count or pricing-based policies.
Request throttling: Support rate limiting by TPM (tokens per minute) and RPM (requests per minute) and more rate limiting stategies.

Architecture

arks-architecture-v1

Arks consists of the following major components:

Gateway Layer: Acts as the unified entry point for all external traffic. It handles request routing and enforces access policies.
- ArksToken: Provides fine-grained multi-tenant access control with support for:
  - API token-based authentication
  - Quota enforcement (based on token usage or pricing)
  - Rate limiting (TPM, RPM)
- ArksEndpoint: Dynamically manages routing rules and traffic distribution across different ArksApplication instances.
  - Supports dynamic weight-based routing
  - Enables automatic application discovery
  - Adjusts traffic flow in real-time based on load or policies
Workload Layer: Each ArksApplication contains one or more runtime instances. Supported runtimes include vLLM, SGLang, Dynamo. Each runtime is deployed as a Kubernetes workload and benefits from:
- Distributed inference across multiple nodes
- Support for heterogeneous computing environments
- Autoscaling via Kubernetes HPA, based on predefined SLOs
Storage Layer: Using ArksModel to manage model storage.
- Supports auto caching of models to reduce cold start time
- Enables model sharing across applications and nodes
- Designed for high-throughput model loading and reuse

More docs:

Quick Start

Prerequisites

Kubernetes cluster (v1.20+)
kubectl configured to access your cluster

Installation

Note: Arks requires LWS v0.7.0 and RBGS v0.5.0-alpha.4. Install LWS before RBGS.

# Install dependencies (skip if already installed with correct version)
kubectl apply --server-side -f https://github.com/envoyproxy/gateway/releases/download/v1.2.8/install.yaml
kubectl apply --server-side -f https://github.com/kubernetes-sigs/lws/releases/download/v0.7.0/manifests.yaml
kubectl apply --server-side -f https://raw.githubusercontent.com/sgl-project/rbg/v0.5.0-alpha.4/deploy/kubectl/manifests.yaml

# Install Arks
git clone https://github.com/scitix/arks.git
cd arks
kubectl apply --server-side -f dist/operator.yaml
kubectl apply --server-side -f dist/gateway.yaml

verification:

# Check all component status, should be ready
kubectl get deployment -n arks-operator-system
---
NAME                               READY   UP-TO-DATE   AVAILABLE   AGE
arks-gateway-plugins               1/1     1            1           22h
arks-operator-controller-manager   1/1     1            1           22h
arks-redis-master                  1/1     1            1           22h

# Check Envoy Gateway status
kubectl get deployment -n   envoy-gateway-system
--- 
NAME                                          READY   UP-TO-DATE   AVAILABLE   AGE
envoy-arks-operator-system-arks-eg-abcedefg   1/1     1            1           22h
envoy-gateway                                 1/1     1            1           22h

Examples

Install with:

kubectl create -f examples/quickstart/quickstart.yaml

Check resources ready:

# Check all ARKS custom resources
kubectl get arksapplication,arksendpoint,arksmodel,arksquota,arkstoken,httproute -owide
---
# REPLICAS should equals to READY, PHASE should be Running
NAME                                      PHASE     REPLICAS   READY   AGE   MODEL     RUNTIME   DRIVER
arksapplication.arks.ai/app-qwen   Running   1          1       21m   qwen-7b   sglang

NAME                                  AGE   DEFAULT WEIGHT
arksendpoint.arks.ai/qwen-7b   21m   5

# PHASE should be Ready
NAME                               AGE   MODEL                         PHASE
arksmodel.arks.ai/qwen-7b   21m   Qwen/Qwen2.5-7B-Instruct-1M   Ready

NAME                                   AGE
arksquota.arks.ai/basic-quota   21m

NAME                                     AGE
arkstoken.arks.ai/example-token   21m

NAME                                          HOSTNAMES   AGE
httproute.gateway.networking.k8s.io/qwen-7b               21m

Testing

Get the gateway IP:

# Option 1: Kubernetes cluster with LoadBalancer support
LB_IP=$(kubectl get svc -n envoy-gateway-system --selector=gateway.envoyproxy.io/owning-gateway-name=arks-eg -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
ENDPOINT="${LB_IP}:80"

# Option 2: Dev environment without LoadBalancer support. Use port forwarding way instead
ENVOY_SERVICE=$(kubectl get svc -n envoy-gateway-system --selector=gateway.envoyproxy.io/owning-gateway-name=arks-eg -o jsonpath='{.items[0].metadata.name}')
kubectl -n envoy-gateway-system port-forward service/${ENVOY_SERVICE} 8888:80 &
ENDPOINT="localhost:8888"

Curl the example app through Envoy proxy:

curl http://${ENDPOINT}/v1/chat/completions -k \
  -H "Authorization: Bearer sk-test123456" \
  -H "Content-Type: application/json" \
  -d '{"model": "qwen-7b", "messages": [{"role": "user", "content": "Hello, who are you?"}]}'

Expected response

{
  "id":"xxxxxxxxx",
  "object":"chat.completion",
  "created": 12332454,
  "model":"qwen-7b",
  "choices":[{
    "index":0,
    "message":{
      "role":"assistant",
      "content":"I'm a large language model created by Alibaba Cloud. I go by the name Qwen.",
      "reasoning_content":null,
      "tool_calls":null
    },
    "logprobs":null,
    "finish_reason":"stop",
    "matched_stop":151645
  }],
  "usage":{
    "prompt_tokens":25,
    "total_tokens":45,
    "completion_tokens":20,
    "prompt_tokens_details":null
  }
}

Clean-Up

kubectl delete -f examples/quickstart/quickstart.yaml --ignore-not-found=true
kubectl delete -f dist/gateway.yaml
kubectl delete -f dist/operator.yaml

Build

It is recommended to compile ARKS using Docker. Here are the relevant commands:

make docker-build-operator
make docker-build-gateway
make docker-build-scripts

License

Arks is licensed under the Apache 2.0 License.

Community, discussion, contribution, and support

For feedback, questions, or contributions, feel free to:

Open an issue on GitHub
Submit a pull request

Related Skills

node-connect

349.7k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

109.7k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

349.7k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

349.7k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。