Olla
High-performance lightweight proxy and load balancer for LLM infrastructure. Intelligent routing, automatic failover and unified model discovery across local and remote inference backends.
Install / Use
/learn @thushan/OllaREADME
[!IMPORTANT]
Olla is currently in active-development. While it is usable, we are still finalising some features and optimisations. Your feedback is invaluable! Open <a href="https://github.com/thushan/olla/issues">an issue</a> and let us know features you'd like to see in the future.
Olla is a high-performance, low-overhead, low-latency proxy and load balancer for managing LLM infrastructure. It intelligently routes LLM requests across local and remote inference nodes with a wide variety of natively supported endpoints and extensible enough to support others. Olla provides model discovery and unified model catalogues within each provider, enabling seamless routing to available models on compatible endpoints.
Olla works alongside API gateways like LiteLLM or orchestration platforms like GPUStack, focusing on making your existing LLM infrastructure reliable through intelligent routing and failover. You can choose between two proxy engines: Sherpa for simplicity and maintainability or Olla for maximum performance with advanced features like circuit breakers and connection pooling.

Single CLI application and config file is all you need to go Olla!
Key Features
- 🔄 Smart Load Balancing: Priority-based routing with automatic failover and connection retry
- 🔍 Smart Model Unification: Per-provider unification + OpenAI-compatible cross-provider routing
- ⚡ Dual Proxy Engines: Sherpa (simple) and Olla (high-performance)
- 🎯 Advanced Filtering: Profile and model filtering with glob patterns for precise control
- 💊 Health Monitoring: Continuous endpoint health checks with circuit breakers and automatic recovery
- 🔁 Intelligent Retry: Automatic retry on connection failures with immediate transparent endpoint failover
- 🔧 Self-Healing: Automatic model discovery refresh when endpoints recover
- 📊 Request Tracking: Detailed response headers and statistics
- ⚡🔄 Anthropic Messages API: Passthrough for backends with native support; automatic translation for others
- 🛡️ Production Ready: Rate limiting, request size limits, graceful shutdown
- ⚡ High Performance: Sub-millisecond endpoint selection with lock-free atomic stats
- 🎯 LLM-Optimised: Streaming-first design with optimised timeouts for long inference
- ⚙️ High Performance: Designed to be very lightweight & efficient, runs on less than 50Mb RAM.
Platform Support
Olla runs on multiple platforms and architectures:
| Platform | AMD64 | ARM64 | Notes | |----------|-------|-------|-------| | Linux | ✅ | ✅ | Full support including Raspberry Pi 4+ | | macOS | ✅ | ✅ | Intel and Apple Silicon (M1/M2/M3/M4) | | Windows | ✅ | ✅ | Windows 10/11 and Windows on ARM | | Docker | ✅ | ✅ | Multi-architecture images (amd64/arm64) |
Quick Start
Installation
# Download latest release (auto-detects your platform)
bash <(curl -s https://raw.githubusercontent.com/thushan/olla/main/install.sh)
# Docker (automatically pulls correct architecture)
docker run -t \
--name olla \
-p 40114:40114 \
ghcr.io/thushan/olla:latest
# Or explicitly specify platform (e.g., for ARM64)
docker run --platform linux/arm64 -t \
--name olla \
-p 40114:40114 \
ghcr.io/thushan/olla:latest
# Install via Go
go install github.com/thushan/olla@latest
# Build from source
git clone https://github.com/thushan/olla.git && cd olla && make build-release
# Run Olla
./bin/olla
Verification
When you have everything running, you can check it's all working with:
# Check health of Olla
curl http://localhost:40114/internal/health
# Check endpoints
curl http://localhost:40114/internal/status/endpoints
# Check models available
curl http://localhost:40114/internal/status/models
For detailed installation and deployment options, see Getting Started Guide.
Querying Olla
Olla exposes multiple API paths depending on your use case:
| Path | Format | Use Case |
|------|--------|----------|
| /olla/proxy/ | OpenAI | Routes to any backend — universal endpoint |
| /olla/openai/ | OpenAI | Routes to any backend — universal endpoint |
| /olla/anthropic/ | Anthropic | Claude-compatible clients (passthrough or translated) |
| /olla/{provider}/ | OpenAI | Target a specific backend type (e.g. /olla/vllm/, /olla/ollama/) |
OpenAI-Compatible (Universal Proxy)
You can use /olla/openai or /olla/proxy
# Chat completion (routes to best available backend)
curl http://localhost:40114/olla/proxy/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "llama3.2", "messages": [{"role": "user", "content": "Hello"}], "max_tokens": 100}'
# Streaming
curl http://localhost:40114/olla/proxy/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "llama3.2", "messages": [{"role": "user", "content": "Hello"}], "max_tokens": 100, "stream": true}'
# List all models across backends
curl http://localhost:40114/olla/proxy/v1/models
Anthropic Messages API
# Chat completion (passthrough for supported backends, translated for others)
curl http://localhost:40114/olla/anthropic/v1/messages \
-H "Content-Type: application/json" \
-H "x-api-key: not-needed" \
-H "anthropic-version: 2023-06-01" \
-d '{"model": "llama3.2", "max_tokens": 100, "messages": [{"role": "user", "content": "Hello"}]}'
# Streaming
curl http://localhost:40114/olla/anthropic/v1/messages \
-H "Content-Type: application/json" \
-H "x-api-key: not-needed" \
-H "anthropic-version: 2023-06-01" \
-d '{"model": "llama3.2", "max_tokens": 100, "messages": [{"role": "user", "content": "Hello"}], "stream": true}'
Provider-Specific Endpoints
# Target a specific backend type directly
curl http://localhost:40114/olla/ollama/v1/chat/completions \
-H "
