SkillAgentSearch skills...

Olla

High-performance lightweight proxy and load balancer for LLM infrastructure. Intelligent routing, automatic failover and unified model discovery across local and remote inference backends.

Install / Use

/learn @thushan/Olla

README

<div align="center"> <img src="assets/images/banner.png" width="480" height="249" alt="Olla - Smart LLM Load Balancer & Proxy" /> <p> <a href="https://github.com/thushan/olla/blob/master/LICENSE"><img src="https://img.shields.io/github/license/thushan/olla" alt="License"></a> <a href="https://golang.org/"><img src="https://img.shields.io/github/go-mod/go-version/thushan/olla" alt="Go"></a> <a href="https://github.com/thushan/olla/actions/workflows/ci.yml"><img src="https://github.com/thushan/olla/actions/workflows/ci.yml/badge.svg?branch=main" alt="CI"></a> <a href="https://goreportcard.com/report/github.com/thushan/olla"><img src="https://goreportcard.com/badge/github.com/thushan/olla" alt="Go Report Card"></a> <a href="https://github.com/thushan/olla/releases/latest"><img src="https://img.shields.io/github/release/thushan/olla" alt="Latest Release"></a> <br /> <a href="https://github.com/ggerganov/llama.cpp"><img src="https://img.shields.io/badge/llama.cpp-native-lightgreen.svg" alt="llama.cpp: Native Support"></a> <a href="https://github.com/vllm-project/vllm"><img src="https://img.shields.io/badge/vLLM-native-lightgreen.svg" alt="vLLM: Native Support"></a> <a href="https://github.com/sgl-project/sglang"><img src="https://img.shields.io/badge/SGLang-native-lightgreen.svg" alt="SGLang: Native Support"></a> <a href="https://github.com/BerriAI/litellm"><img src="https://img.shields.io/badge/LiteLLM-native-lightgreen.svg" alt="LiteLLM: Native Support"></a> <a href="https://github.com/InternLM/lmdeploy"><img src="https://img.shields.io/badge/LM Deploy-openai-lightblue.svg" alt="LM Deploy: OpenAI Compatible"></a> <br/> <a href="https://github.com/waybarrios/vllm-mlx/"><img src="https://img.shields.io/badge/vLLM--MLX-native-lightgreen.svg" alt="vLLM-MLX: Native Support"></a> <a href="https://docs.docker.com/ai/model-runner/"><img src="https://img.shields.io/badge/Docker Model Runner-native-lightgreen.svg" alt="Docker Model Runner: Native Support"></a><br/> <a href="https://ollama.com"><img src="https://img.shields.io/badge/Ollama-native-lightgreen.svg" alt="Ollama: Native Support"></a> <a href="https://lmstudio.ai/"><img src="https://img.shields.io/badge/LM Studio-native-lightgreen.svg" alt="LM Studio: Native Support"></a> <a href="https://github.com/lemonade-sdk/lemonade"><img src="https://img.shields.io/badge/LemonadeSDK-native-lightgreen.svg" alt="LemonadeSDK: Native Support"></a> </P> <p> <div align="center"> <img src="./docs/content/assets/demos/olla-v1.0.x-demo.gif" height="" width="" /><br/> <small>Recorded with <a href="https://vhs.charm.sh/">VHS</a> - see <a href="./docs/vhs/demo.tape">demo tape</a></small><br/><br/> </div> <a href="https://thushan.github.io/olla/"><img src="https://img.shields.io/badge/📖_Documentation-0078D4?style=for-the-badge&logoColor=white" height="32" alt="Documentation"></a> &nbsp; <a href="https://github.com/thushan/olla/issues"><img src="https://img.shields.io/badge/🐛_Issues-D73502?style=for-the-badge&logoColor=white" height="32" alt="Issues"></a> &nbsp; <a href="https://github.com/thushan/olla/releases"><img src="https://img.shields.io/badge/🚀_Releases-6f42c1?style=for-the-badge&logoColor=white" height="32" alt="Releases"></a> </p> </div>

[!IMPORTANT]
Olla is currently in active-development. While it is usable, we are still finalising some features and optimisations. Your feedback is invaluable! Open <a href="https://github.com/thushan/olla/issues">an issue</a> and let us know features you'd like to see in the future.

Olla is a high-performance, low-overhead, low-latency proxy and load balancer for managing LLM infrastructure. It intelligently routes LLM requests across local and remote inference nodes with a wide variety of natively supported endpoints and extensible enough to support others. Olla provides model discovery and unified model catalogues within each provider, enabling seamless routing to available models on compatible endpoints.

Olla works alongside API gateways like LiteLLM or orchestration platforms like GPUStack, focusing on making your existing LLM infrastructure reliable through intelligent routing and failover. You can choose between two proxy engines: Sherpa for simplicity and maintainability or Olla for maximum performance with advanced features like circuit breakers and connection pooling.

Olla Single OpenAI

Single CLI application and config file is all you need to go Olla!

Key Features

Platform Support

Olla runs on multiple platforms and architectures:

| Platform | AMD64 | ARM64 | Notes | |----------|-------|-------|-------| | Linux | ✅ | ✅ | Full support including Raspberry Pi 4+ | | macOS | ✅ | ✅ | Intel and Apple Silicon (M1/M2/M3/M4) | | Windows | ✅ | ✅ | Windows 10/11 and Windows on ARM | | Docker | ✅ | ✅ | Multi-architecture images (amd64/arm64) |

Quick Start

Installation

# Download latest release (auto-detects your platform)
bash <(curl -s https://raw.githubusercontent.com/thushan/olla/main/install.sh)
# Docker (automatically pulls correct architecture)
docker run -t \
  --name olla \
  -p 40114:40114 \
  ghcr.io/thushan/olla:latest

# Or explicitly specify platform (e.g., for ARM64)
docker run --platform linux/arm64 -t \
  --name olla \
  -p 40114:40114 \
  ghcr.io/thushan/olla:latest
# Install via Go
go install github.com/thushan/olla@latest
# Build from source
git clone https://github.com/thushan/olla.git && cd olla && make build-release
# Run Olla
./bin/olla

Verification

When you have everything running, you can check it's all working with:

# Check health of Olla
curl http://localhost:40114/internal/health

# Check endpoints
curl http://localhost:40114/internal/status/endpoints

# Check models available
curl http://localhost:40114/internal/status/models

For detailed installation and deployment options, see Getting Started Guide.

Querying Olla

Olla exposes multiple API paths depending on your use case:

| Path | Format | Use Case | |------|--------|----------| | /olla/proxy/ | OpenAI | Routes to any backend — universal endpoint | | /olla/openai/ | OpenAI | Routes to any backend — universal endpoint | | /olla/anthropic/ | Anthropic | Claude-compatible clients (passthrough or translated) | | /olla/{provider}/ | OpenAI | Target a specific backend type (e.g. /olla/vllm/, /olla/ollama/) |

OpenAI-Compatible (Universal Proxy)

You can use /olla/openai or /olla/proxy

# Chat completion (routes to best available backend)
curl http://localhost:40114/olla/proxy/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3.2", "messages": [{"role": "user", "content": "Hello"}], "max_tokens": 100}'

# Streaming
curl http://localhost:40114/olla/proxy/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3.2", "messages": [{"role": "user", "content": "Hello"}], "max_tokens": 100, "stream": true}'

# List all models across backends
curl http://localhost:40114/olla/proxy/v1/models

Anthropic Messages API

# Chat completion (passthrough for supported backends, translated for others)
curl http://localhost:40114/olla/anthropic/v1/messages \
  -H "Content-Type: application/json" \
  -H "x-api-key: not-needed" \
  -H "anthropic-version: 2023-06-01" \
  -d '{"model": "llama3.2", "max_tokens": 100, "messages": [{"role": "user", "content": "Hello"}]}'

# Streaming
curl http://localhost:40114/olla/anthropic/v1/messages \
  -H "Content-Type: application/json" \
  -H "x-api-key: not-needed" \
  -H "anthropic-version: 2023-06-01" \
  -d '{"model": "llama3.2", "max_tokens": 100, "messages": [{"role": "user", "content": "Hello"}], "stream": true}'

Provider-Specific Endpoints

# Target a specific backend type directly
curl http://localhost:40114/olla/ollama/v1/chat/completions \
  -H "
View on GitHub
GitHub Stars175
CategoryOperations
Updated5h ago
Forks22

Languages

Go

Security Score

100/100

Audited on Apr 1, 2026

No findings