SkillAgentSearch skills...

Kaito

Kubernetes AI Toolchain Operator

Install / Use

/learn @kaito-project/Kaito
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

Kubernetes AI Toolchain Operator (KAITO)

GitHub Release Go Report Card GitHub go.mod Go version codecov FOSSA Status

| notification What is NEW! | | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ALL vLLM supported modeled can be run in KAITO now, check the latest release. | | Latest Release: Feb 26th, 2026. KAITO v0.9.0. | | First Release: Nov 15th, 2023. KAITO v0.1.0. |

KAITO is an operator suite that automates LLM model inference, fine-tuning, and RAG (Retrieval Augmented Generation) engine deployment in a Kubernetes cluster. KAITO has the following key differentiations compared to other inference model deployment methodologies:

  • Simplify the CRD API by removing detailed deployment parameters. The controller provides optimized preset configurations for key inference engine scheduling parameters such as pipeline parallelism (PP), data parallelism (DP), tensor parallelism (TP), max model length, etc.
  • Use node auto provisioner (NAP) to provision GPU resources with accurate model memory estimation, enabling the controller to pick the optimal node count for distributed inference.
  • Leverage GPU node built-in local NVMe as model storage — no extra storage is required for inference.
  • Support any vLLM-supported HuggingFace models.

Architecture

KAITO follows the classic Kubernetes Custom Resource Definition(CRD)/controller design pattern for workload orchestration and integrates with Gateway API Inference Extension to support LLM based routing.

<div align="center"> <img src="website/static/img/arch.png" width=100% title="KAITO architecture" alt="KAITO architecture"> </div>
  • Workspace: The CRD that serves as the basic building block for managing LLM inference/tuning workloads. The API provides a largely simplified experience for deploying an LLM model in Kubernetes - the user provides the GPU instance type and the HuggingFace model ID, the controller will:

    • Estimate the GPU memory requirement based on the GPU instance type and model metadata, and calculate the required GPU count;
    • Trigger GPU node auto-provisioning by integrating with Karpenter APIs (NodePool);
    • Configure the inference engine parameters for single node/multiple nodes inference with optimized scheduling based on the GPU hardware topology.

    Currently, only the vLLM engine is supported. LoRA adapters are supported. KVCache offloading is enabled by default.

  • InferenceSet: The CRD designed for managing the number of replicas of workspace instances for the same model. It is primarily used to autoscale the workspace based on inference request load. It reacts to scale-up/down actions determined by a KEDA autoscaler that uses vLLM metrics collected by a KEDA plugin.

  • InferencePool: KAITO integrates Gateway API Inference Extension by creating corresponding InferencePool object and EPP (Endpoint Picker, which enables KVCache-aware routing) per InferenceSet. It can work with any external gateway that supports the inference extension.

Note: In this repo, an open-source gpu-provisioner is used in the E2E test and is referred in various documents. KAITO can work with any other node provisioners that support the Karpenter-core APIs.

KAITO also support a RAGEngine operator. It streamlines the process of managing a Retrieval Augmented Generation(RAG) service.

<div align="center"> <img src="website/static/img/ragarch.png" width=90% title="KAITO RAGEngine architecture" alt="KAITO RAGEngine architecture"> </div>
  • RAGEngine: The CRD that defines the components composed of a RAG service, including the LLM endpoint (optional), the embedding service and the vector DB. The controller will create all required components.
  • Vector database: support a built-in FAISS in-memory vector database (default), and Qdrant/Milvus persistent databases if specified.
  • Embedding: support both local and remote embedding services, to embed documents in the vector database.
  • RAGService: The core service that leverages the LlamaIndex orchestration. It supports commonly used APIs such as /index for indexing documents, /v1/chat/completion for intercepting LLM calls to append retrieved context automatically, and /retrieve for integrating with MCP servers. The /retrieve API uses the Reciprocal Rank Fusion (RRF) hybrid search algorithm to combine the results from both BM25 sparse retrieval and vector dense retrieval.

The details of the service APIs can be found in this document.

Getting Started

  • Installation: Please check the guidance here for installing core components (Workspace, InferenceSet) using helm and here for installation using Terraform.
  • Quick Start: Please check the quick start guidance here for running your first model using KAITO!
  • AutoScaling: Please check this doc for configuring KAITO and KEDA to enable autoscaling inference workload.
  • BYO models using HuggingFace runtime: If you plan to run any BYO models using the HuggingFace runtime, check this doc. Note: KATIO only supports BYO models hosted in HuggingFace.
  • CPU models: Please check this doc for running CPU models using aikit.
  • RAGEngine: Please check the installation guidance and usage documents here.

Contributing

Read more

<!-- markdown-link-check-disable -->

This project welcomes contributions and suggestions. The contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit CLAs for CNCF.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the CLAs for CNCF, please electronically sign the CLA via https://easycla.lfx.linuxfoundation.org. If you encounter issues, you can submit a ticket with the Linux Foundation ID group through the Linux Foundation Support website.

Get Involved!

License

See Apache License 2.0.

FOSSA Status

Code of Conduct

KAITO has adopted the Cloud Native Compute Foundation Code of Conduct. For more information see the KAITO Code of Conduct.

<!-- markdown-link-check-enable -->

Contact

View on GitHub
GitHub Stars914
CategoryDevelopment
Updated13h ago
Forks167

Languages

Go

Security Score

85/100

Audited on Apr 3, 2026

No findings