Mimir

mimir is a drop-in proxy that caches LLM API responses using semantic similarity, reducing costs and latency for repeated or similar queries.

Generate Convert Improve

Install / Use

/learn @aqstack/Mimir

About this skill

Quality Score

0/100

README

mimir

LLM Semantic Cache

mimir is a drop-in proxy that caches LLM API responses using semantic similarity, reducing costs and latency for repeated or similar queries.

Features

Semantic Caching - Cache hits for semantically similar prompts, not just exact matches
Free Local Embeddings - Use Ollama for embeddings with zero API costs
OpenAI-Compatible - Drop-in replacement proxy for OpenAI API
Configurable Threshold - Tune similarity sensitivity (0.0-1.0)
TTL Support - Time-based cache expiration
Zero Dependencies - Single binary, no external database required
Docker Ready - Simple containerized deployment

How It Works

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Client    │────▶│    mimir    │────▶│  LLM API    │
│  (app/pod)  │◀────│   (proxy)   │◀────│ (OpenAI/..) │
└─────────────┘     └──────┬──────┘     └─────────────┘
                           │
                    ┌──────▼──────┐
                    │ Vector Store│
                    │ (embeddings)│
                    └─────────────┘

Incoming request is converted to an embedding
Cache is searched for semantically similar previous requests
If similarity exceeds threshold → return cached response
Otherwise → forward to upstream, cache response

Quick Start

Option 1: Local Embeddings with Ollama (Free)

# Install Ollama (if not already installed)
brew install ollama  # macOS
# or: curl -fsSL https://ollama.com/install.sh | sh  # Linux

# Start Ollama and pull embedding model
ollama serve &
ollama pull nomic-embed-text

# Clone and run mimir
git clone https://github.com/aqstack/mimir.git
cd mimir
make build
./bin/mimir

Option 2: OpenAI Embeddings

# Clone and build
git clone https://github.com/aqstack/mimir.git
cd mimir
make build

# Run with OpenAI
export OPENAI_API_KEY=sk-...
./bin/mimir

Using Docker

# With Ollama (requires Ollama running on host)
docker run -p 8080:8080 -e OLLAMA_BASE_URL=http://host.docker.internal:11434 ghcr.io/aqstack/mimir:latest

# With OpenAI
docker run -p 8080:8080 -e OPENAI_API_KEY=$OPENAI_API_KEY ghcr.io/aqstack/mimir:latest

Usage

Point your OpenAI client to mimir instead of the OpenAI API:

from openai import OpenAI

# Point to mimir proxy
client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="your-api-key"  # or use OPENAI_API_KEY env var
)

response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "What is the capital of France?"}]
)

# Check cache status in response headers
# X-Mimir-Cache: HIT or MISS
# X-Mimir-Similarity: 0.9823 (if HIT)

Configuration

| Environment Variable | Default | Description | |---------------------|---------|-------------| | MIMIR_EMBEDDING_PROVIDER | ollama | Embedding provider: ollama or openai | | MIMIR_EMBEDDING_MODEL | nomic-embed-text | Embedding model name | | OLLAMA_BASE_URL | http://localhost:11434 | Ollama server URL | | OPENAI_API_KEY | - | OpenAI API key (auto-switches provider if set) | | OPENAI_BASE_URL | https://api.openai.com/v1 | Upstream API URL | | MIMIR_PORT | 8080 | Server port | | MIMIR_HOST | 0.0.0.0 | Server host | | MIMIR_SIMILARITY_THRESHOLD | 0.95 | Minimum similarity for cache hit (0.0-1.0) | | MIMIR_CACHE_TTL | 24h | Cache entry time-to-live | | MIMIR_MAX_CACHE_SIZE | 10000 | Maximum cache entries | | MIMIR_LOG_JSON | false | JSON log format |

Embedding Models

Ollama (free, local):

nomic-embed-text (768 dims, recommended)
mxbai-embed-large (1024 dims)
all-minilm (384 dims, fastest)

OpenAI (paid):

text-embedding-3-small (1536 dims, recommended)
text-embedding-3-large (3072 dims)
text-embedding-ada-002 (1536 dims)

API Endpoints

| Endpoint | Description | |----------|-------------| | POST /v1/chat/completions | Chat completions (cached) | | GET /health | Health check | | GET /stats | Cache statistics | | * /v1/* | Other OpenAI endpoints (passthrough) |

Cache Statistics

curl http://localhost:8080/stats

{
  "total_entries": 150,
  "total_hits": 1234,
  "total_misses": 567,
  "hit_rate": 0.685,
  "estimated_saved_usd": 1.234
}

Tuning the Similarity Threshold

The MIMIR_SIMILARITY_THRESHOLD controls how similar a query must be to trigger a cache hit:

| Threshold | Behavior | |-----------|----------| | 0.99 | Nearly exact matches only | | 0.95 | Very similar queries (recommended) | | 0.90 | Moderate similarity | | 0.85 | Loose matching (may return less relevant) |

Roadmap

[x] Local embeddings with Ollama
[ ] Redis/Qdrant backend for persistence
[ ] Prometheus metrics
[ ] Cache warming
[ ] Support for Anthropic, Gemini APIs

Contributing

Contributions are welcome! Please open an issue or submit a pull request.

License

MIT License - see LICENSE for details.

Related Skills

node-connect

341.0k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

xurl

341.0k

A CLI tool for making authenticated requests to the X (Twitter) API. Use this skill when you need to post tweets, reply, quote, search, read posts, manage followers, send DMs, upload media, or interact with any X API v2 endpoint.

frontend-design

84.4k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

341.0k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).