FsGEPA
A F# implementation of GEPA (Genetic Evolutionary Prompt Augmentation) for optimizing compound AI systems. Based on the paper "GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning."
Install / Use
/learn @fwaris/FsGEPAREADME
FsGepa
FsGepa is an F# implementation of prompt optimization for compound AI systems. The repo includes the standard GEPA optimizer and a VISTA optimizer mode for hypothesis-driven reflective updates and restart behavior.
- Original paper GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning
- Project write-up on LinkedIn (GEPA only)
- GEPA vs VISTA Comparison
Start here
The easiest way to understand the library is through the included samples:
- FsgSample.Fvr: a richer multi-step FEVEROUS claim-verification flow with multiple modules
- FsgSample.Gsm8k: a lighter-weight GSM8K benchmark sample for direct
GEPAvsVISTAcomparison
If you want the simplest benchmark harness first, start with the GSM8K sample. If you want to understand how to wire a more realistic multi-step system into FsGepa, start with the FEVEROUS sample.
Overview
Automated prompt tuning can become expensive because optimization requires many repeated model calls. GEPA is designed to be relatively frugal while still improving prompt quality for a compound system rather than a single isolated prompt.
In FsGepa, a candidate system is a GeSystem<'input,'output>:
- it contains one or more prompt-bearing modules
- it exposes a
flowfunction that runs the full task - the flow may call external tools and may invoke one or more modules multiple times
Optimization operates over tasks, where each task contains:
- the task input
- an evaluation function that scores the resulting flow output
- optional feedback used during reflective updates
Optimizer modes
FsGepa currently supports two optimizer modes:
GEPA: reflective updates plus system-aware merges over a candidate poolVISTA: hypothesis-driven diagnosis and validation, with configurable restart behavior
Both modes share the same core abstractions and can optimize the same GeSystem.
How GEPA works
Inspired by genetic algorithms, GEPA evolves new candidate systems from an existing population:
- In a reflective update, prompts are revised by reflecting on existing prompts together with sampled task inputs, outputs, feedback, and reasoning traces when available.
- In a system-aware merge, a new candidate is proposed by combining prompts from related candidates and their parent systems.
Important ideas carried into this implementation:
Pareto frontier: candidate selection is based on quality-diversity over a fixed pareto task set rather than always picking the single best candidateReflection: new candidates are guided by task-level feedback instead of relying only on few-shot exemplarsFrugality: proposed candidates are screened on mini-batches before receiving more expensive evaluation
Included benchmarks and samples
Two sample projects are included in the repo today:
src/FsgSample.Fvr- based on FEVEROUS
- demonstrates a multi-module flow that summarizes evidence and then classifies a claim
- useful as an end-to-end example of wiring a realistic compound system into FsGepa
src/FsgSample.Gsm8k- based on GSM8K
- demonstrates a tighter exact-match benchmark for comparing
GEPAandVISTA - includes both
defectiveandminimalseeds modeled after the VISTA paper appendix -
Note: stronger models (e.g. gpt-oss-20b and above) are not deterred by that
defectiveseed. They produce good baselines results on GSM8K - which are harder to improve upon much with further optimization. TheVISTApaper uses a much smallerQwen3-4Bmodel to demonstrate the algorithm's effectiveness.
Performance
The GEPA paper already establishes strong benchmark performance and includes ablation studies. This repo focuses on providing a practical F# implementation and runnable samples.
In the FEVEROUS sample, FsGepa can substantially improve holdout accuracy over the seed prompts with a modest budget. The GSM8K sample serves a different purpose: it gives a cleaner benchmark harness for comparing optimizer behavior under the same model, seed, and evaluation setup.
Results depend heavily on the backend model, prompt seed, and service stability, so sample documentation should be treated as directional rather than universal.
Implementation notes
This implementation aims to stay faithful to the algorithms described in the papers, while still making practical engineering tradeoffs around transport, retries, telemetry, and concurrency.
One important operational setting is flow_parallelism, which is used as the per-run cap for concurrent outbound model calls. This matters especially when running against local or unstable backends.
Practical notes
- You need access to one or more LLM backends, either local or hosted
- Optimization can be expensive because it repeatedly evaluates and rewrites prompts
- Local GPU-backed models are often the most convenient way to experiment cheaply
- The sample projects are the best reference for how to construct
Config, tasks, systems, and evaluation logic
Related Skills
YC-Killer
2.7kA library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.
best-practices-researcher
The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app
groundhog
400Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).
last30days-skill
20.0kAI agent skill that researches any topic across Reddit, X, YouTube, HN, Polymarket, and the web - then synthesizes a grounded summary
