Gepa
Optimize prompts, code, and more with AI-powered Reflective Text Evolution
Install / Use
/learn @gepa-ai/GepaREADME
What is GEPA?
GEPA (Genetic-Pareto) is a framework for optimizing any system with textual parameters against any evaluation metric. Unlike RL or gradient-based methods that collapse execution traces into a single scalar reward, GEPA uses LLMs to read full execution traces — error messages, profiling data, reasoning logs — to diagnose why a candidate failed and propose targeted fixes. Through iterative reflection, mutation, and Pareto-aware selection, GEPA evolves high-performing variants with minimal evaluations.
If you can measure it, you can optimize it: prompts, code, agent architectures, scheduling policies, vector graphics, and more.
Key Results
| | | |---|---| | 90x cheaper | Open-source models + GEPA beat Claude Opus 4.1 at Databricks | | 35x faster than RL | 100–500 evaluations vs. 5,000–25,000+ for GRPO (paper) | | 32% → 89% | ARC-AGI agent accuracy via architecture discovery | | 40.2% cost savings | Cloud scheduling policy discovered by GEPA, beating expert heuristics | | 55% → 82% | Coding agent resolve rate on Jinja via auto-learned skills | | 50+ production uses | Across Shopify, Databricks, Dropbox, OpenAI, Pydantic, MLflow, Comet ML, and more |
"Both DSPy and (especially) GEPA are currently severely under hyped in the AI context engineering world" — Tobi Lutke, CEO, Shopify
Installation
pip install gepa
To install the latest from main:
pip install git+https://github.com/gepa-ai/gepa.git
Quick Start
Simple Prompt Optimization
Optimize a system prompt for math problems from the AIME benchmark in a few lines of code (full tutorial):
import gepa
trainset, valset, _ = gepa.examples.aime.init_dataset()
seed_prompt = {
"system_prompt": "You are a helpful assistant. Answer the question. "
"Put your final answer in the format '### <answer>'"
}
result = gepa.optimize(
seed_candidate=seed_prompt,
trainset=trainset,
valset=valset,
task_lm="openai/gpt-4.1-mini",
max_metric_calls=150,
reflection_lm="openai/gpt-5",
)
print("Optimized prompt:", result.best_candidate['system_prompt'])
Result: GPT-4.1 Mini goes from 46.6% → 56.6% on AIME 2025 (+10 percentage points).
With DSPy (Recommended for AI Pipelines)
The most powerful way to use GEPA for prompt optimization is within DSPy, where it's available as dspy.GEPA. See dspy.GEPA tutorials for executable notebooks.
import dspy
optimizer = dspy.GEPA(
metric=your_metric,
max_metric_calls=150,
reflection_lm="openai/gpt-5",
)
optimized_program = optimizer.compile(student=MyProgram(), trainset=trainset, valset=valset)
optimize_anything: Beyond Prompts
The optimize_anything API optimizes any text artifact — code, agent architectures, configurations, SVGs — not just prompts. You provide an evaluator; the system handles the search.
import gepa.optimize_anything as oa
from gepa.optimize_anything import optimize_anything, GEPAConfig, EngineConfig
def evaluate(candidate: str) -> float:
result = run_my_system(candidate)
oa.log(f"Output: {result.output}") # Actionable Side Information
oa.log(f"Error: {result.error}") # feeds back into reflection
return result.score
result = optimize_anything(
seed_candidate="<your initial artifact>",
evaluator=evaluate,
objective="Describe what you want to optimize for.",
config=GEPAConfig(engine=EngineConfig(max_metric_calls=100)),
)
How It Works
Traditional optimizers know that a candidate failed but not why. GEPA takes a different approach:
- Select a candidate from the Pareto frontier (candidates excelling on different task subsets)
- Execute on a minibatch, capturing full execution traces
- Reflect — an LLM reads the traces (error messages, profiler output, reasoning logs) and diagnoses failures
- Mutate — generate an improved candidate informed by accumulated lessons from all ancestors
- Accept — add to the pool if improved, update the Pareto front
GEPA also supports system-aware merge — combining strengths of two Pareto-optimal candidates excelling on different tasks. The key concept is Actionable Side Information (ASI): diagnostic feedback returned by evaluators that serves as the text-optimization analogue of a gradient.
For details, see the paper and the documentation.
Adapters: Plug GEPA into Any System
GEPA connects to your system via the GEPAAdapter interface — implement evaluate and make_reflective_dataset, and GEPA handles the rest.
Built-in adapters:
| Adapter | Description | |---|---| | DefaultAdapter | System prompt optimization for single-turn LLM tasks | | DSPy Full Program | Evolves entire DSPy programs (signatures, modules, control flow). 67% → 93% on MATH. | | Generic RAG | Vector store-agnostic RAG optimization (ChromaDB, Weaviate, Qdrant, Pinecone) | | MCP Adapter | Optimize MCP tool descriptions and system prompts | | TerminalBench | Optimize the Terminus terminal-use agent | | AnyMaths | Mathematical problem-solving and reasoning tasks |
See the adapters guide for how to build your own, and DSPy's adapter as a reference.
Integrations
GEPA is integrated into several major frameworks:
- DSPy —
dspy.GEPAfor optimizing DSPy programs. Tutorials. - MLflow —
mlflow.genai.optimize_prompts()for automatic prompt improvement. - Comet ML Opik — Core optimization algorithm in Opik Agent Optimizer.
- Pydantic — Prompt optimization for Pydantic AI.
- OpenAI Cookbook — Self-evolving agents with GEPA.
- HuggingFace Cookbook — Prompt optimization guide.
- Google ADK — Optimizing Google Agent Development Kit agents.
Example Optimized Prompts
GEPA can be thought of as precomputing reasoning during optimization to produce a plan for future task instances. Here are examples of the detailed prompts GEPA discovers:
<table> <tr> <td colspan="2" align="center">Example GEPA Prompts</td> </tr> <tr> <td align="center">HotpotQA (multi-hop QA) Prompt</td> <td align="center">AIME Prompt</td> </tr> <tr> <td width="52%" valign="top"> <img src="https://raw.githubusercontent.com/gepa-ai/gepa/refs/heads/main/assets/gepa_prompt_hotpotqa.png" alt="HotpotQA Prompt" width="1400"> <!-- <td> --> <details> <summary><mark>Click to view full HotpotQA prompt</mark><Related Skills
node-connect
339.1kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
83.8kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
339.1kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
83.8kCommit, push, and open a PR
