OneLLM

A "drop-in" replacement for OpenAI's client that offers a unified interface for interacting with large language models from various providers, with support for hundreds of models, intelligent semantic caching, built-in fallback mechanisms, and enhanced reliability features.

📚 Table of Contents

👉 Overview

OneLLM is a lightweight, provider-agnostic Python library that offers a unified interface for interacting with large language models (LLMs) from various providers. It simplifies the integration of LLMs into applications by providing a consistent API while abstracting away provider-specific implementation details.

The library follows the OpenAI client API design pattern, making it familiar to developers already using OpenAI and enabling easy migration for existing applications. Simply change your import statements and instantly gain access to hundreds of models across dozens of providers while maintaining your existing code structure.

With support for 22 implemented providers (and more planned), OneLLM gives you access to approximately 300+ unique language models through a single, consistent interface - from the latest proprietary models to open-source alternatives, all accessible through familiar OpenAI-compatible patterns.

[!NOTE] Ready for Use: OneLLM now supports 22 providers with 300+ models! From cloud APIs to local models, you can access them all through a single, unified interface. Contributions are welcome to help add even more providers!

🚀 Getting Started

Installation

# Basic installation (includes OpenAI compatibility and download utility)
pip install OneLLM

# For all providers (includes dependencies for future provider support)
pip install "OneLLM[all]"

Download Models for Local Use

OneLLM includes a built-in utility for downloading GGUF models:

# Download a model from HuggingFace (saves to ~/llama_models by default)
onellm download --repo-id "TheBloke/Llama-2-7B-GGUF" --filename "llama-2-7b.Q4_K_M.gguf"

# Download to a custom directory
onellm download -r "microsoft/Phi-3-mini-4k-instruct-gguf" -f "Phi-3-mini-4k-instruct-q4.gguf" -o /path/to/models

Quick Win: Your First LLM Call

# Basic usage with OpenAI-compatible syntax
from onellm import ChatCompletion

response = ChatCompletion.create(
    model="openai/gpt-4o-mini",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Hello, how are you?"}
    ]
)

print(response.choices[0].message["content"])
# Output: I'm doing well, thank you for asking! I'm here and ready to help you...

For more detailed examples, check out the examples directory.

✨ Key Features

| Feature | Description | |---------|-------------| | 📦 Drop-in replacement | Use your existing OpenAI code with minimal changes | | 🔄 Provider-agnostic | Support for 300+ models across 20 implemented providers | | ⚡ Blazing-fast semantic cache | 42,000-143,000x faster responses, 50-80% cost savings with streaming support & TTL | | 🔌 Connection pooling | Reuse HTTP connections for 100-300ms faster sequential calls | | 🔁 Automatic fallback | Seamlessly switch to alternative models when needed | | 🔄 Auto-retry mechanism | Retry the same model multiple times before failing | | 🧩 OpenAI-compatible | Familiar interface for developers used to OpenAI | | 📺 Streaming support | Real-time streaming responses from supported providers | | 🖼️ Multi-modal capabilities | Support for text, images, audio across compatible models | | 🏠 Local LLM support | Run models locally via Ollama and llama.cpp | | ⬇️ Model downloads | Built-in CLI to download GGUF models from HuggingFace | | 🧹 Unicode artifact cleaning | Automatic removal of invisible characters to prevent AI detection | | 🏷️ Consistent naming | Clear provider/model-name format for attribution | | 🧪 Comprehensive tests | Extensive unit and integration test suite | | 📄 Apache-2.0 license | Open-source license that protects contributions |

🌐 Supported Providers

OneLLM currently supports 22 providers with more on the way:

Cloud API Providers (20)

Anthropic - Claude family of models
Anyscale - Configurable AI platform
AWS Bedrock - Access to multiple model families
Azure OpenAI - Microsoft-hosted OpenAI models
Cohere - Command models with RAG
DeepSeek - Chinese LLM provider
Fireworks - Fast inference platform
Moonshot - Kimi models with long-context capabilities
Google AI Studio - Gemini models via API key
Groq - Ultra-fast inference for Llama, Mixtral
GLM (Z.AI) - OpenAI-compatible GLM-4 family
Mistral - Mistral Large, Medium, Small
OpenAI - GPT-4o, 3o-mini, DALL-E, Whisper, etc.
OpenRouter - Gateway to 100+ models
Perplexity - Search-augmented models
Together AI - Open-source model hosting
Vercel AI Gateway - Gateway to 100+ models from multiple providers
Vertex AI - Google Cloud's enterprise Gemini
X.AI - Grok models
MiniMax - M2 model series with advanced reasoning

Local Providers (2)

Ollama - Run models locally with easy management
llama.cpp - Direct GGUF model execution

Notable Models Available

Through these providers, you gain access to hundreds of models, including:

<div align="center">  <table> <tr> <th>Model Family</th> <th>Notable Models</th> </tr> <tr> <td><strong>OpenAI Family</strong></td> <td>GPT-4o, GPT-4 Turbo, o3</td> </tr> <tr> <td><strong>Claude Family</strong></td> <td>Claude 3 Opus, Claude 3 Sonnet, Claude 3 Haiku</td> </tr> <tr> <td><strong>Llama Family</strong></td> <td>Llama 3 70B, Llama 3 8B, Code Llama</td> </tr> <tr> <td><strong>Mistral Family</strong></td> <td>Mistral Large, Mistral 7B, Mixtral</td> </tr> <tr> <td><strong>Gemini Family</strong></td> <td>Gemini Pro, Gemini Ultra, Gemini Flash</td> </tr> <tr> <td><strong>Embeddings</strong></td> <td>Ada-002, text-embedding-3-small/large, Cohere embeddings</td> </tr> <tr> <td><strong>Multimodal</strong></td> <td>GPT-4 Vision, Claude 3 Vision, Gemini Pro Vision</td> </tr> </table> </div>

🏗️ Architecture

OneLLM follows a modular architecture with clear separation of concerns:

---
config:
  look: handDrawn
  theme: mc
  themeVariables:
    background: 'transparent'
    primaryColor: '#fff0'
    secondaryColor: 'transparent'
    tertiaryColor: 'transparent'
    mainBkg: 'transparent'

  flowchart:
    layout: fixed
---
flowchart TD
    %% User API Layer
    User(User Application) --> ChatCompletion
    User --> Completion
    User --> Embedding
    User --> OpenAIClient["OpenAI Client Interface"]

    subgraph API["Public API Layer"]
        ChatCompletion["ChatCompletion\n.create() / .acreate()"]
        Completion["Completion\n.create() / .acreate()"]
        Embedding["Embedding\n.create() / .acreate()"]
        OpenAIClient
    end

    %% Core logic
    subgraph Core["Core Logic"]
        Router["Provider Router"]
        Config["Configuration\nEnvironment Variables\nAPI Keys"]
        FallbackManager["Fallback Manager"]
        RetryManager["Retry Manager"]
    end

    %% Provider Layer
    BaseProvider["Provider Interface<br>(Base Class)"]

    subgraph Implementations["Provider Implementations"]
        OpenAI["OpenAI"]
        Anthropic["Anthropic"]
        GoogleProvider["Google"]
        Groq["Groq"]
        Ollama["Local LLMs"]
        OtherProviders["20+ Others"]
    end

    %% Utilities
    subgraph Utilities["Utilities"]
        Streaming["Streaming<br>Handlers"]
        TokenCounting["Token<br>Counter"]
        ErrorHandling["Error<br>Handling"]

Onellm

Install / Use

README