Llm Router
Smart LLM router for Claude Code — auto-picks cheapest model per task, routes within Claude subscription first. 70-85% cost savings.
Install / Use
/learn @ypollak2/Llm RouterQuality Score
Category
Development & EngineeringSupported Platforms
README
LLM Router
Route cheap work away from premium models.
LLM Router is an MCP server and hook set that intercepts prompts and routes them to the cheapest model that can handle the task.
It is built for a common failure mode in AI coding tools: using your best model for everything. In Claude Code, that burns quota on simple explanations, file lookups, small edits, and repetitive prompts. In other MCP clients, it means paying premium-model prices for work that never needed them.
The goal is simple: keep cheap work on cheap or free models, keep hard work on Claude or other premium models, and remove the need to micromanage model selection. Works in Claude Code, Cursor, Windsurf, Zed, claw-code, and Agno.
Why
Most sessions contain a lot of low-value turns: quick questions, repo lookups, boilerplate edits, and small follow-ups. Those are exactly the prompts that quietly burn through premium models.
LLM Router offloads that work first, then escalates when the task actually needs more capability.
- Cheap work stays cheap.
- Hard work still gets the best model.
- Your workflow stays the same.
It does not try to replace Claude or force weak models onto hard tasks. It removes the waste around them.
Quick Start
pipx install claude-code-llm-router && llm-router install
llm-router install registers the MCP server and installs hooks so prompt routing starts automatically.
If you use Claude Code Pro/Max, you can start with zero API keys. Otherwise add GEMINI_API_KEY for a cheap free-tier fallback.
GEMINI_API_KEY=AIza... # optional free-tier fallback
LLM_ROUTER_CLAUDE_SUBSCRIPTION=true
How It Works
- Intercept the prompt before your default premium model sees it.
- Classify the task and its complexity.
- Try the cheapest capable route first.
- Escalate or fall back when the task needs more capability.
Under the hood, every prompt goes through a UserPromptSubmit hook before your top-tier model sees it:
0. Context inherit instant, free "yes/ok/go ahead" reuse prior turn's route
1. Heuristic scoring instant, free high-confidence patterns route immediately
2. Ollama local LLM free, ~1s catches what heuristics miss
3. Cheap API ~$0.0001 Gemini Flash / GPT-4o-mini fallback
| Prompt | Classified as | Routed to | |--------|--------------|-----------| | "What does os.path.join do?" | query/simple | Gemini Flash ($0.000001) | | "Fix the bug in auth.py" | code/moderate | Haiku / Sonnet | | "Design the full auth system" | code/complex | Sonnet / Opus | | "Research latest AI funding" | research | Perplexity Sonar Pro | | "Generate a hero image" | image | Flux Pro via fal.ai |
Free-first chain (subscription mode): Ollama → Codex (free via OpenAI sub) → paid API
MCP Tools
41 tools across 6 categories:
Smart Routing
| Tool | What it does |
|------|-------------|
| llm_route | Auto-classify prompt → route to best model |
| llm_auto | Route + server-side savings tracking — designed for hook-less hosts (Codex CLI, Claude Desktop, Copilot) |
| llm_classify | Classify complexity + recommend model |
| llm_select_agent | Pick agent CLI (claude_code / codex) + model for a session |
| llm_stream | Stream LLM response for long-running tasks |
Text & Code
| Tool | What it does |
|------|-------------|
| llm_query | General questions — routed to cheapest capable model |
| llm_research | Web-grounded answers via Perplexity Sonar |
| llm_generate | Creative writing, summaries, brainstorming |
| llm_analyze | Deep reasoning — analysis, debugging, design review |
| llm_code | Code generation, refactoring, algorithms |
| llm_edit | Route edit reasoning to cheap model → returns {file, old, new} patch pairs |
Filesystem
| Tool | What it does |
|------|-------------|
| llm_fs_find | Describe files to find → cheap model returns glob/grep commands |
| llm_fs_rename | Describe a rename → returns mv/git mv commands (dry_run by default) |
| llm_fs_edit_many | Bulk edits across files → returns all patch pairs |
Media
| Tool | What it does |
|------|-------------|
| llm_image | Image generation — Flux, DALL-E, Gemini Imagen |
| llm_video | Video generation — Runway, Kling, Veo 2 |
| llm_audio | TTS/voice — ElevenLabs, OpenAI |
Orchestration
| Tool | What it does |
|------|-------------|
| llm_orchestrate | Multi-step pipeline across multiple models |
| llm_pipeline_templates | List available pipeline templates |
Monitoring & Admin
| Tool | What it does |
|------|-------------|
| llm_usage | Unified dashboard — Claude sub, Codex, APIs, savings |
| llm_savings | Cross-session savings breakdown by period, host, and task type |
| llm_check_usage | Live Claude subscription usage (session %, weekly %) |
| llm_health | Provider availability + circuit breaker status |
| llm_providers | List all configured providers and models |
| llm_set_profile | Switch profile: budget / balanced / premium |
| llm_setup | Interactive provider wizard — add keys, validate, install hooks |
| llm_quality_report | Routing accuracy, savings metrics, classifier stats |
| llm_rate | Rate last response 👍/👎 — logged for quality tracking |
| llm_codex | Route task to local Codex desktop agent (free) |
| llm_save_session | Persist session summary for cross-session context |
| llm_cache_stats | Cache hit rate, entries, evictions |
| llm_cache_clear | Clear classification cache |
| llm_refresh_claude_usage | Force-refresh subscription data via OAuth |
| llm_update_usage | Feed usage data from claude.ai into the router |
| llm_track_usage | Report Claude Code token usage for budget tracking |
| llm_dashboard | Open web dashboard at localhost:7337 |
| llm_team_report | Team-wide routing savings report |
| llm_team_push | Push local savings data to shared team store |
| llm_policy | Show active org/repo routing policy + last 10 policy decisions |
| llm_digest | Savings digest with spend-spike detection; push to Slack/Discord webhook |
| llm_benchmark | Per-task-type routing accuracy from llm_rate feedback |
Routing Profiles
Three profiles — switch anytime with llm_set_profile:
| Profile | Use case | Chain |
|---------|----------|-------|
| budget | Dev, drafts, exploration | Ollama → Haiku → Gemini Flash |
| balanced | Production work (default) | Codex → Sonnet → GPT-4o |
| premium | Critical tasks, max quality | Codex → Opus → o3 |
Profile is overridden by complexity: simple prompts always use the budget chain, complex ones escalate to premium, regardless of the active profile setting.
Providers
| Provider | Models | Free tier | Best for | |----------|--------|-----------|----------| | Ollama | Any local model | Yes (forever) | Privacy, zero cost, offline | | Google Gemini | 2.5 Flash, 2.5 Pro | Yes (1M tokens/day) | Generation, long context | | Groq | Llama 3.3, Mixtral | Yes | Ultra-fast inference | | OpenAI | GPT-4o, o3, DALL-E | No | Code, reasoning, images | | Perplexity | Sonar, Sonar Pro | No | Research, current events | | Anthropic | Haiku, Sonnet, Opus | No | Writing, analysis, safety | | DeepSeek | V3, Reasoner | Limited | Cost-effective reasoning | | Mistral | Large, Small | Limited | Multilingual | | fal.ai | Flux, Kling, Veo | No | Images, video, audio | | ElevenLabs | Voice models | Limited | High-quality TTS | | Runway | Gen-3 | No | Professional video |
Full setup guides: docs/PROVIDERS.md
Works With
Claude Code
Auto-installed by llm-router install. Hooks intercept every prompt — you never need to call tools manually unless you want explicit control.
pipx install claude-code-llm-router && llm-router install
Live status bar shows routing stats before every prompt and in the persistent bottom statusline:
📊 CC 13%s · 24%w │ sub:0 · free:305 · paid:27 │ $1.59 saved (35%)
claw-code
Add to ~/.claw-code/mcp.json:
{
"mcpServers": {
"llm-router": { "command": "llm-router", "args": [] }
}
}
Every API call in claw-code is paid — the free-first chain (Ollama → Codex → Gemini Flash) saves more here than in Claude Code.
Cursor / Windsurf / Zed
Add to your IDE's MCP config:
{
"mcpServers": {
"llm-router": { "command": "llm-router", "args": [] }
}
}
Agno (multi-agent)
Two integration modes:
Option 1 — RouteredModel (v2.0+): use llm-router as a first-class Agno model. Every agent call is automatically routed to the cheapest capable provider.
pip install "claude-code-llm-router[agno]"
from agno.agent import Agent
from llm_router.integrations.agno import RouteredModel, RouteredTeam
# Single agent — routes each call intelligently
coder = Agent(
model=RouteredModel(task_type="code", profile="balanced"),
instructions="You are a coding assistant.",
)
coder.print_response("Write a Python qu
