Onellm
Unified interface for interacting with various LLMs hundreds of models, caching, fallback mechanisms, and enhanced reliability.
Install / Use
/learn @muxi-ai/OnellmQuality Score
Category
Development & EngineeringSupported Platforms
README
OneLLM
A "drop-in" replacement for OpenAI's client that offers a unified interface for interacting with large language models from various providers, with support for hundreds of models, intelligent semantic caching, built-in fallback mechanisms, and enhanced reliability features.
📚 Table of Contents
- Overview
- Getting Started
- Key Features
- Supported Providers
- Architecture
- API Design
- Advanced Features
- Migration from OpenAI
- Model Naming Convention
- Configuration
- Versioning
- Documentation
- Call for Contributions
- License
- Acknowledgements
👉 Overview
OneLLM is a lightweight, provider-agnostic Python library that offers a unified interface for interacting with large language models (LLMs) from various providers. It simplifies the integration of LLMs into applications by providing a consistent API while abstracting away provider-specific implementation details.
The library follows the OpenAI client API design pattern, making it familiar to developers already using OpenAI and enabling easy migration for existing applications. Simply change your import statements and instantly gain access to hundreds of models across dozens of providers while maintaining your existing code structure.
With support for 22 implemented providers (and more planned), OneLLM gives you access to approximately 300+ unique language models through a single, consistent interface - from the latest proprietary models to open-source alternatives, all accessible through familiar OpenAI-compatible patterns.
[!NOTE] Ready for Use: OneLLM now supports 22 providers with 300+ models! From cloud APIs to local models, you can access them all through a single, unified interface. Contributions are welcome to help add even more providers!
🚀 Getting Started
Installation
# Basic installation (includes OpenAI compatibility and download utility)
pip install OneLLM
# For all providers (includes dependencies for future provider support)
pip install "OneLLM[all]"
Download Models for Local Use
OneLLM includes a built-in utility for downloading GGUF models:
# Download a model from HuggingFace (saves to ~/llama_models by default)
onellm download --repo-id "TheBloke/Llama-2-7B-GGUF" --filename "llama-2-7b.Q4_K_M.gguf"
# Download to a custom directory
onellm download -r "microsoft/Phi-3-mini-4k-instruct-gguf" -f "Phi-3-mini-4k-instruct-q4.gguf" -o /path/to/models
Quick Win: Your First LLM Call
# Basic usage with OpenAI-compatible syntax
from onellm import ChatCompletion
response = ChatCompletion.create(
model="openai/gpt-4o-mini",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello, how are you?"}
]
)
print(response.choices[0].message["content"])
# Output: I'm doing well, thank you for asking! I'm here and ready to help you...
For more detailed examples, check out the examples directory.
✨ Key Features
| Feature | Description |
|---------|-------------|
| 📦 Drop-in replacement | Use your existing OpenAI code with minimal changes |
| 🔄 Provider-agnostic | Support for 300+ models across 20 implemented providers |
| ⚡ Blazing-fast semantic cache | 42,000-143,000x faster responses, 50-80% cost savings with streaming support & TTL |
| 🔌 Connection pooling | Reuse HTTP connections for 100-300ms faster sequential calls |
| 🔁 Automatic fallback | Seamlessly switch to alternative models when needed |
| 🔄 Auto-retry mechanism | Retry the same model multiple times before failing |
| 🧩 OpenAI-compatible | Familiar interface for developers used to OpenAI |
| 📺 Streaming support | Real-time streaming responses from supported providers |
| 🖼️ Multi-modal capabilities | Support for text, images, audio across compatible models |
| 🏠 Local LLM support | Run models locally via Ollama and llama.cpp |
| ⬇️ Model downloads | Built-in CLI to download GGUF models from HuggingFace |
| 🧹 Unicode artifact cleaning | Automatic removal of invisible characters to prevent AI detection |
| 🏷️ Consistent naming | Clear provider/model-name format for attribution |
| 🧪 Comprehensive tests | Extensive unit and integration test suite |
| 📄 Apache-2.0 license | Open-source license that protects contributions |
🌐 Supported Providers
OneLLM currently supports 22 providers with more on the way:
Cloud API Providers (20)
- Anthropic - Claude family of models
- Anyscale - Configurable AI platform
- AWS Bedrock - Access to multiple model families
- Azure OpenAI - Microsoft-hosted OpenAI models
- Cohere - Command models with RAG
- DeepSeek - Chinese LLM provider
- Fireworks - Fast inference platform
- Moonshot - Kimi models with long-context capabilities
- Google AI Studio - Gemini models via API key
- Groq - Ultra-fast inference for Llama, Mixtral
- GLM (Z.AI) - OpenAI-compatible GLM-4 family
- Mistral - Mistral Large, Medium, Small
- OpenAI - GPT-4o, 3o-mini, DALL-E, Whisper, etc.
- OpenRouter - Gateway to 100+ models
- Perplexity - Search-augmented models
- Together AI - Open-source model hosting
- Vercel AI Gateway - Gateway to 100+ models from multiple providers
- Vertex AI - Google Cloud's enterprise Gemini
- X.AI - Grok models
- MiniMax - M2 model series with advanced reasoning
Local Providers (2)
- Ollama - Run models locally with easy management
- llama.cpp - Direct GGUF model execution
Notable Models Available
Through these providers, you gain access to hundreds of models, including:
<div align="center"> <!-- Model categories --> <table> <tr> <th>Model Family</th> <th>Notable Models</th> </tr> <tr> <td><strong>OpenAI Family</strong></td> <td>GPT-4o, GPT-4 Turbo, o3</td> </tr> <tr> <td><strong>Claude Family</strong></td> <td>Claude 3 Opus, Claude 3 Sonnet, Claude 3 Haiku</td> </tr> <tr> <td><strong>Llama Family</strong></td> <td>Llama 3 70B, Llama 3 8B, Code Llama</td> </tr> <tr> <td><strong>Mistral Family</strong></td> <td>Mistral Large, Mistral 7B, Mixtral</td> </tr> <tr> <td><strong>Gemini Family</strong></td> <td>Gemini Pro, Gemini Ultra, Gemini Flash</td> </tr> <tr> <td><strong>Embeddings</strong></td> <td>Ada-002, text-embedding-3-small/large, Cohere embeddings</td> </tr> <tr> <td><strong>Multimodal</strong></td> <td>GPT-4 Vision, Claude 3 Vision, Gemini Pro Vision</td> </tr> </table> </div>🏗️ Architecture
OneLLM follows a modular architecture with clear separation of concerns:
---
config:
look: handDrawn
theme: mc
themeVariables:
background: 'transparent'
primaryColor: '#fff0'
secondaryColor: 'transparent'
tertiaryColor: 'transparent'
mainBkg: 'transparent'
flowchart:
layout: fixed
---
flowchart TD
%% User API Layer
User(User Application) --> ChatCompletion
User --> Completion
User --> Embedding
User --> OpenAIClient["OpenAI Client Interface"]
subgraph API["Public API Layer"]
ChatCompletion["ChatCompletion\n.create() / .acreate()"]
Completion["Completion\n.create() / .acreate()"]
Embedding["Embedding\n.create() / .acreate()"]
OpenAIClient
end
%% Core logic
subgraph Core["Core Logic"]
Router["Provider Router"]
Config["Configuration\nEnvironment Variables\nAPI Keys"]
FallbackManager["Fallback Manager"]
RetryManager["Retry Manager"]
end
%% Provider Layer
BaseProvider["Provider Interface<br>(Base Class)"]
subgraph Implementations["Provider Implementations"]
OpenAI["OpenAI"]
Anthropic["Anthropic"]
GoogleProvider["Google"]
Groq["Groq"]
Ollama["Local LLMs"]
OtherProviders["20+ Others"]
end
%% Utilities
subgraph Utilities["Utilities"]
Streaming["Streaming<br>Handlers"]
TokenCounting["Token<br>Counter"]
ErrorHandling["Error<br>Handling"]
