FsGEPA

A F# implementation of GEPA (Genetic Evolutionary Prompt Augmentation) for optimizing compound AI systems. Based on the paper "GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning."

Generate Convert Improve

Install / Use

/learn @fwaris/FsGEPA

About this skill

Quality Score

0/100

README

FsGepa

FsGepa is an F# implementation of prompt optimization for compound AI systems. The repo includes the standard GEPA optimizer and a VISTA optimizer mode for hypothesis-driven reflective updates and restart behavior.

Start here

The easiest way to understand the library is through the included samples:

FsgSample.Fvr: a richer multi-step FEVEROUS claim-verification flow with multiple modules
FsgSample.Gsm8k: a lighter-weight GSM8K benchmark sample for direct GEPA vs VISTA comparison

If you want the simplest benchmark harness first, start with the GSM8K sample. If you want to understand how to wire a more realistic multi-step system into FsGepa, start with the FEVEROUS sample.

Overview

Automated prompt tuning can become expensive because optimization requires many repeated model calls. GEPA is designed to be relatively frugal while still improving prompt quality for a compound system rather than a single isolated prompt.

In FsGepa, a candidate system is a GeSystem<'input,'output>:

it contains one or more prompt-bearing modules
it exposes a flow function that runs the full task
the flow may call external tools and may invoke one or more modules multiple times

Optimization operates over tasks, where each task contains:

the task input
an evaluation function that scores the resulting flow output
optional feedback used during reflective updates

Optimizer modes

FsGepa currently supports two optimizer modes:

GEPA: reflective updates plus system-aware merges over a candidate pool
VISTA: hypothesis-driven diagnosis and validation, with configurable restart behavior

Both modes share the same core abstractions and can optimize the same GeSystem.

How GEPA works

Inspired by genetic algorithms, GEPA evolves new candidate systems from an existing population:

In a reflective update, prompts are revised by reflecting on existing prompts together with sampled task inputs, outputs, feedback, and reasoning traces when available.
In a system-aware merge, a new candidate is proposed by combining prompts from related candidates and their parent systems.

Important ideas carried into this implementation:

Pareto frontier: candidate selection is based on quality-diversity over a fixed pareto task set rather than always picking the single best candidate
Reflection: new candidates are guided by task-level feedback instead of relying only on few-shot exemplars
Frugality: proposed candidates are screened on mini-batches before receiving more expensive evaluation

Included benchmarks and samples

Two sample projects are included in the repo today:

src/FsgSample.Fvr
- based on FEVEROUS
- demonstrates a multi-module flow that summarizes evidence and then classifies a claim
- useful as an end-to-end example of wiring a realistic compound system into FsGepa
src/FsgSample.Gsm8k
- based on GSM8K
- demonstrates a tighter exact-match benchmark for comparing GEPA and VISTA
- includes both defective and minimal seeds modeled after the VISTA paper appendix
- Note: stronger models (e.g. gpt-oss-20b and above) are not deterred by that defective seed. They produce good baselines results on GSM8K - which are harder to improve upon much with further optimization. The VISTA paper uses a much smaller Qwen3-4B model to demonstrate the algorithm's effectiveness.

Performance

The GEPA paper already establishes strong benchmark performance and includes ablation studies. This repo focuses on providing a practical F# implementation and runnable samples.

In the FEVEROUS sample, FsGepa can substantially improve holdout accuracy over the seed prompts with a modest budget. The GSM8K sample serves a different purpose: it gives a cleaner benchmark harness for comparing optimizer behavior under the same model, seed, and evaluation setup.

Results depend heavily on the backend model, prompt seed, and service stability, so sample documentation should be treated as directional rather than universal.

Implementation notes

This implementation aims to stay faithful to the algorithms described in the papers, while still making practical engineering tradeoffs around transport, retries, telemetry, and concurrency.

One important operational setting is flow_parallelism, which is used as the per-run cap for concurrent outbound model calls. This matters especially when running against local or unstable backends.

Practical notes

You need access to one or more LLM backends, either local or hosted
Optimization can be expensive because it repeatedly evaluates and rewrites prompts
Local GPU-backed models are often the most convenient way to experiment cheaply
The sample projects are the best reference for how to construct Config, tasks, systems, and evaluation logic

Related Skills

YC-Killer

2.7k

A library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.

best-practices-researcher

The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app

groundhog

400

Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).

last30days-skill

20.0k

AI agent skill that researches any topic across Reddit, X, YouTube, HN, Polymarket, and the web - then synthesizes a grounded summary