Agent Runtime Intelligence Layer

<br>

Cost Savings: 69% (MT-Bench), 93% (GSM8K), 52% (MMLU), 80% (TruthfulQA) savings, retaining 96% GPT-5 quality.

<br>

<img src=".github/assets/CF_python_color.svg" width="22" height="22" alt="Python" style="vertical-align: middle;"/> Python • <img src=".github/assets/CF_ts_color.svg" width="22" height="22" alt="TypeScript" style="vertical-align: middle;"/> TypeScript • <picture><source media="(prefers-color-scheme: dark)" srcset="./.github/assets/LC-logo-bright.png"><source media="(prefers-color-scheme: light)" srcset="./.github/assets/LC-logo-dark.png"><img src=".github/assets/LC-logo-dark.png" height="22" alt="LangChain" style="vertical-align: middle;"></picture> LangChain • <img src=".github/assets/CF_n8n_color.svg" width="22" height="22" alt="n8n" style="vertical-align: middle;"/> n8n • <picture><source media="(prefers-color-scheme: dark)" srcset="./.github/assets/CF_vercel_bright.svg"><source media="(prefers-color-scheme: light)" srcset="./.github/assets/CF_vercel_dark.svg"><img src=".github/assets/CF_vercel_dark.svg" width="22" height="22" alt="Vercel AI" style="vertical-align: middle;"></picture> Vercel AI • <img src=".github/assets/CF_openclaw_color.svg" width="22" height="22" alt="OpenClaw" style="vertical-align: middle;"/> OpenClaw • <img src=".github/assets/CF_google_adk_color.svg" width="22" height="22" alt="Google ADK" style="vertical-align: middle;"/> Google ADK • 📖 Docs • 💡 Examples

</div>

The in-process intelligence layer for AI agents. Optimize cost, latency, quality, budget, compliance, and energy — inside the execution loop, not at the HTTP boundary.

cascadeflow works where external proxies can't: per-step model decisions based on agent state, per-tool-call budget gating, runtime stop/continue/escalate actions, and business KPI injection during agent loops. It accumulates insight from every model call, tool result, and quality score — the agent gets smarter the more it runs. Sub-5ms overhead. Works with LangChain, OpenAI Agents SDK, CrewAI, Google ADK, n8n, and Vercel AI SDK.

pip install cascadeflow

npm install @cascadeflow/core

Why cascadeflow?

Proxy vs In-Process Harness

| Dimension | External Proxy | cascadeflow Harness | |---|---|---| | Scope | HTTP request boundary | Inside agent execution loop | | Dimensions | Cost only | Cost + quality + latency + budget + compliance + energy | | Latency overhead | 10-50ms network RTT | <5ms in-process | | Business logic | None | KPI weights and targets | | Enforcement | None (observe only) | stop, deny_tool, switch_model | | Auditability | Request logs | Per-step decision traces |

cascadeflow is a library and agent harness — an intelligent AI model cascading package that dynamically selects the optimal model for each query or tool call through speculative execution. It's based on the research that 40-70% of queries don't require slow, expensive flagship models, and domain-specific smaller models often outperform large general-purpose models on specialized tasks. For the remaining queries that need advanced reasoning, cascadeflow automatically escalates to flagship models if needed.

<details> <summary><b>Use Cases</b></summary>

Inside-the-Loop Control. Influence decisions at every agent step — model call, tool call, sub-agent handoff — where most cost, delay, and failure actually happen. External proxies only see request boundaries; cascadeflow sees decision boundaries.
Multi-Dimensional Optimization. Optimize across cost, latency, quality, budget, compliance/risk, and energy simultaneously — relevant to engineering, finance, security, operations, and sustainability stakeholders.
Business Logic Injection. Embed KPI weights and policy intent directly into agent behavior at runtime. Shift AI control from static prompt design to live business governance.
Runtime Enforcement. Directly steer outcomes with four actions: allow, switch_model, deny_tool, stop — based on current context and policy state. Closes the gap between analytics and execution.
Auditability & Transparency. Every runtime decision is traceable and attributable. Supports audit requirements, faster tuning cycles, and trust in regulated or high-stakes workflows.
Measurable Value. Prove impact with reproducible metrics on realistic agent workflows — better economics and latency while preserving quality thresholds.
Latency Advantage. Proxy-based optimization adds 40-60ms per call. In a 10-step agent loop, that is 400-600ms of avoidable overhead. cascadeflow runs in-process with sub-5ms overhead — critical for real-time UX, task throughput, and enterprise SLAs.
Framework & Provider Neutral. Works with LangChain, OpenAI Agents SDK, CrewAI, Google ADK, Vercel AI SDK, n8n, and custom frameworks. Unified API across OpenAI, Anthropic, Groq, Ollama, vLLM, Together, and more.
Self-Improving Agent Intelligence. Because cascadeflow runs inside the agent loop, it accumulates deep insight into every model call, tool result, quality score, and routing decision over time. This enables cascadeflow to learn which models perform best for which tasks, adapt routing strategies, and continuously improve cost-quality tradeoffs — without manual tuning. The agent gets smarter the more it runs.
Edge & Local-Hosted AI. Handle most queries with local models (vLLM, Ollama), automatically escalate complex queries to cloud providers only when needed.

ℹ️ Note: SLMs (under 10B parameters) are sufficiently powerful for 60-70% of agentic AI tasks. Research paper

</details>

How cascadeflow Works

cascadeflow uses speculative execution with quality validation:

Speculatively executes small, fast models first - optimistic execution ($0.15-0.30/1M tokens)
Validates quality of responses using configurable thresholds (completeness, confidence, correctness)
Dynamically escalates to larger models only when quality validation fails ($1.25-3.00/1M tokens)
Learns patterns to optimize future cascading decisions and domain specific routing

Zero configuration. Works with YOUR existing models (>17 providers currently supported).

In practice, 60-70% of queries are handled by small, efficient models (8-20x cost difference) without requiring escalation

Result: 40-85% cost reduction, 2-10x faster responses, zero quality loss.

┌─────────────────────────────────────────────────────────────┐
│                      cascadeflow Stack                      │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌───────────────────────────────────────────────────────┐  │
│  │  Cascade Agent                                        │  │
│  │                                                       │  │
│  │  Orchestrates the entire cascade execution            │  │
│  │  • Query routing & model selection                    │  │
│  │  • Drafter -> Verifier coordination                   │  │
│  │  • Cost tracking & telemetry                          │  │
│  └───────────────────────────────────────────────────────┘  │
│                          ↓

Cascadeflow

Install / Use

README

Agent Runtime Intelligence Layer

Why cascadeflow?

Proxy vs In-Process Harness

How cascadeflow Works