Mimir
mimir is a drop-in proxy that caches LLM API responses using semantic similarity, reducing costs and latency for repeated or similar queries.
Install / Use
/learn @aqstack/MimirREADME
mimir
LLM Semantic Cache
mimir is a drop-in proxy that caches LLM API responses using semantic similarity, reducing costs and latency for repeated or similar queries.
Features
- Semantic Caching - Cache hits for semantically similar prompts, not just exact matches
- Free Local Embeddings - Use Ollama for embeddings with zero API costs
- OpenAI-Compatible - Drop-in replacement proxy for OpenAI API
- Configurable Threshold - Tune similarity sensitivity (0.0-1.0)
- TTL Support - Time-based cache expiration
- Zero Dependencies - Single binary, no external database required
- Docker Ready - Simple containerized deployment
How It Works
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Client │────▶│ mimir │────▶│ LLM API │
│ (app/pod) │◀────│ (proxy) │◀────│ (OpenAI/..) │
└─────────────┘ └──────┬──────┘ └─────────────┘
│
┌──────▼──────┐
│ Vector Store│
│ (embeddings)│
└─────────────┘
- Incoming request is converted to an embedding
- Cache is searched for semantically similar previous requests
- If similarity exceeds threshold → return cached response
- Otherwise → forward to upstream, cache response
Quick Start
Option 1: Local Embeddings with Ollama (Free)
# Install Ollama (if not already installed)
brew install ollama # macOS
# or: curl -fsSL https://ollama.com/install.sh | sh # Linux
# Start Ollama and pull embedding model
ollama serve &
ollama pull nomic-embed-text
# Clone and run mimir
git clone https://github.com/aqstack/mimir.git
cd mimir
make build
./bin/mimir
Option 2: OpenAI Embeddings
# Clone and build
git clone https://github.com/aqstack/mimir.git
cd mimir
make build
# Run with OpenAI
export OPENAI_API_KEY=sk-...
./bin/mimir
Using Docker
# With Ollama (requires Ollama running on host)
docker run -p 8080:8080 -e OLLAMA_BASE_URL=http://host.docker.internal:11434 ghcr.io/aqstack/mimir:latest
# With OpenAI
docker run -p 8080:8080 -e OPENAI_API_KEY=$OPENAI_API_KEY ghcr.io/aqstack/mimir:latest
Usage
Point your OpenAI client to mimir instead of the OpenAI API:
from openai import OpenAI
# Point to mimir proxy
client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="your-api-key" # or use OPENAI_API_KEY env var
)
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": "What is the capital of France?"}]
)
# Check cache status in response headers
# X-Mimir-Cache: HIT or MISS
# X-Mimir-Similarity: 0.9823 (if HIT)
Configuration
| Environment Variable | Default | Description |
|---------------------|---------|-------------|
| MIMIR_EMBEDDING_PROVIDER | ollama | Embedding provider: ollama or openai |
| MIMIR_EMBEDDING_MODEL | nomic-embed-text | Embedding model name |
| OLLAMA_BASE_URL | http://localhost:11434 | Ollama server URL |
| OPENAI_API_KEY | - | OpenAI API key (auto-switches provider if set) |
| OPENAI_BASE_URL | https://api.openai.com/v1 | Upstream API URL |
| MIMIR_PORT | 8080 | Server port |
| MIMIR_HOST | 0.0.0.0 | Server host |
| MIMIR_SIMILARITY_THRESHOLD | 0.95 | Minimum similarity for cache hit (0.0-1.0) |
| MIMIR_CACHE_TTL | 24h | Cache entry time-to-live |
| MIMIR_MAX_CACHE_SIZE | 10000 | Maximum cache entries |
| MIMIR_LOG_JSON | false | JSON log format |
Embedding Models
Ollama (free, local):
nomic-embed-text(768 dims, recommended)mxbai-embed-large(1024 dims)all-minilm(384 dims, fastest)
OpenAI (paid):
text-embedding-3-small(1536 dims, recommended)text-embedding-3-large(3072 dims)text-embedding-ada-002(1536 dims)
API Endpoints
| Endpoint | Description |
|----------|-------------|
| POST /v1/chat/completions | Chat completions (cached) |
| GET /health | Health check |
| GET /stats | Cache statistics |
| * /v1/* | Other OpenAI endpoints (passthrough) |
Cache Statistics
curl http://localhost:8080/stats
{
"total_entries": 150,
"total_hits": 1234,
"total_misses": 567,
"hit_rate": 0.685,
"estimated_saved_usd": 1.234
}
Tuning the Similarity Threshold
The MIMIR_SIMILARITY_THRESHOLD controls how similar a query must be to trigger a cache hit:
| Threshold | Behavior |
|-----------|----------|
| 0.99 | Nearly exact matches only |
| 0.95 | Very similar queries (recommended) |
| 0.90 | Moderate similarity |
| 0.85 | Loose matching (may return less relevant) |
Roadmap
- [x] Local embeddings with Ollama
- [ ] Redis/Qdrant backend for persistence
- [ ] Prometheus metrics
- [ ] Cache warming
- [ ] Support for Anthropic, Gemini APIs
Contributing
Contributions are welcome! Please open an issue or submit a pull request.
License
MIT License - see LICENSE for details.
Related Skills
node-connect
341.0kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
xurl
341.0kA CLI tool for making authenticated requests to the X (Twitter) API. Use this skill when you need to post tweets, reply, quote, search, read posts, manage followers, send DMs, upload media, or interact with any X API v2 endpoint.
frontend-design
84.4kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
341.0kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
