AutoEvaluation

Evals that fix themselves.

Give it a prompt, a set of test scenarios, and a scoring rubric. It runs autonomously: generate outputs, score them, read the judge's reasoning, find the weakest metric, rewrite the prompt to fix it, re-score, keep or revert. Hill-climbing on prompt engineering, fully hands-off.

I pointed it at a writing style guide and let it run overnight. It made 20 attempts, kept 2, and improved the composite score from 0.9508 to 0.9692. The changes it made: strengthened contraction rules, added concrete before/after examples for em dash replacement. Every other LLM prompt optimiser (DSPy, TextGrad, MIPRO) requires you to write Python. This one works on plain markdown files.

Point it at any LLM instruction set. Go to bed. Wake up with a measurably better prompt.

How it works

graph LR
    A["Analyse<br/>weakness + judge reasoning"] --> B["Modify<br/>SKILL.md"]
    B --> C["Evaluate<br/>samples"]
    C --> D["Decide<br/>keep / revert"]
    D --> A

Analyse — reads the weakest metrics AND the actual sample outputs that scored poorly, including the judge's reasoning for each score. The modifier sees why scores are low, not just numbers.
Modify — makes ONE targeted change to the skill instructions, grounded in concrete failure examples.
Evaluate — generates outputs using the modified skill, scores them against your rubric.
Decide — if the score improved above the noise threshold, keep the change; otherwise revert. Small deltas that could be random noise are filtered out.
Repeat — until the iteration, time, or cost limit is hit (or indefinitely).

What makes this different

Every other prompt optimiser treats prompts as parameters to optimise computationally. DSPy requires a Python DSL. AutoPrompt needs labelled datasets. OpenAI's optimizer is platform-locked. Meta's prompt-ops is Llama-only.

AutoEvaluation treats prompts as prose documents that an LLM reads, critiques, and rewrites. No DSL. No compilation step. No framework lock-in. Just a markdown file and test prompts. It's "editor doing revision" vs "compiler doing gradient descent."

Real results

I ran AutoEvaluation on an anti-AI writing style guide (the included example) for 20 iterations using Gemini 2.5 Flash:

Iteration   Score    Decision   What the AI changed
─────────   ─────    ────────   ────────────────────────────────────────────
baseline    0.9508   —          Starting point
exp_002     0.9600   KEEP       Strengthened contraction rule with emphasis
exp_005     0.9692   KEEP       Added concrete em-dash before/after example

18 of 20 attempts were discarded (score didn't improve enough to pass the noise threshold). The 2 that stuck made targeted, specific changes. Total run time: ~2 hours. Total API cost: <$2.

The full experiment history is in examples/writing-style/sample-results.tsv.

AutoEvaluation dashboard showing score trend and per-metric cards

Quick start

Prerequisites

Python 3.10+
An API key for your preferred LLM provider (Gemini, OpenAI, or Anthropic)

One command start

git clone https://github.com/AdenCJM/AutoEvaluation.git
cd AutoEvaluation
echo "GEMINI_API_KEY=your-key" > .env
./start.sh

start.sh handles everything: checks your Python version, creates a virtual environment, installs only the provider SDK you need (not all three), validates your API key, runs setup if needed, and starts the optimisation loop. If anything is wrong, it tells you immediately.

Try the included example

The repo ships with a complete working example (a writing style guide):

echo "GEMINI_API_KEY=your-key" > .env
cp examples/writing-style/SKILL.md SKILL.md
cp examples/writing-style/config.yaml config.yaml
cp examples/writing-style/prompts.json prompts/prompts.json
cp examples/writing-style/eval_deterministic.py tools/eval_deterministic.py
./start.sh

Point at your own skill

Already have a skill file you want to optimise? Two options:

Quick (no prompts, all defaults):

echo "GEMINI_API_KEY=your-key" > .env
python3 setup.py --defaults --skill-file /path/to/your/SKILL.md --generate-prompts
./start.sh

This validates your API key, uses AI to generate test prompts from your skill file, applies sensible defaults (3 evaluation dimensions, 10 iterations), and you're running.

Guided (interactive wizard):

python3 tools/run_loop.py --skill path/to/your/SKILL.md --provider gemini --iterations 10

This auto-generates config.yaml with sensible defaults and starts optimising immediately.

Setup wizard

python3 setup.py

The wizard walks you through:

Provider + model — pick Gemini, OpenAI, or Anthropic (API key validated instantly)
Your skill — paste or describe the instructions you want to optimise
Test prompts — AI generates prompts from your skill description, or enter manually
Eval rubric — set 2-5 quality dimensions (or use the defaults)
Run duration — max iterations, max hours, or unlimited

It generates: config.yaml, SKILL.md, prompts/prompts.json, .env, and .claude/settings.json.

Skip all prompts:

# All defaults: Gemini, default rubric, 5 generic prompts, 10 iterations
python3 setup.py --defaults

# Defaults with a custom skill and AI-generated prompts
python3 setup.py --defaults --skill-file SKILL.md --generate-prompts

# Defaults with OpenAI instead of Gemini
python3 setup.py --defaults --provider openai

Already have a skill file? Skip the paste step:

python3 setup.py --skill-file /path/to/your/SKILL.md
python3 setup.py --skill-file SKILL.md --prompts-file my-prompts.json

With Claude Code (autonomous)

If you have Claude Code installed, it can drive the optimisation loop autonomously:

python3 setup.py    # or use --defaults
claude -p program.md

Claude reads program.md, which contains the loop instructions. It autonomously runs experiments, modifies your skill, and tracks results. All bash commands are auto-approved via .claude/settings.json.

Watch scores in real time

Open another terminal:

python3 tools/dashboard_server.py

Then open http://localhost:8050 in your browser.

How the optimiser thinks

The optimisation loop doesn't just look at score numbers. For each iteration, it:

Reads the judge's reasoning for the 2 worst-scoring samples. Not "task_accuracy = 0.72" but "the output ignored the instruction to avoid em dashes in paragraph 3."
Reads the actual sample text that scored poorly, so it can see the concrete failure.
Makes one targeted change based on that specific failure, not a guess from numbers.
Validates the returned skill hasn't been truncated or corrupted (checks frontmatter, section headers).
Filters noise: only keeps changes where the score improvement exceeds a configurable threshold (default 1%), so random variance doesn't pollute the skill.

This means the headless mode (run_loop.py) is just as effective as the Claude Code mode. Both see the same signal.

Test prompts

Test prompts are realistic tasks that exercise your skill. Create prompts/prompts.json:

[
  {
    "id": "intro_email",
    "genre": "cold outreach",
    "prompt": "Write a 200-word cold email to a VP of Engineering introducing our product."
  },
  {
    "id": "follow_up",
    "genre": "cold outreach",
    "prompt": "Write a 150-word follow-up email after no response to the initial outreach."
  }
]

Each prompt needs:

id — short identifier (alphanumeric, underscores, hyphens; auto-sanitised)
genre — category (used for context in evaluation)
prompt — the actual task the LLM will perform using your skill

Aim for 5-10 prompts that cover different aspects of your skill. More prompts = more reliable scores, but each one costs an LLM call per iteration.

BYO model

AutoEvaluation works with any LLM provider. Set your provider in config.yaml:

# Gemini
provider: gemini
model: gemini-2.5-flash
api_key_env: GEMINI_API_KEY

# OpenAI
provider: openai
model: gpt-4o
api_key_env: OPENAI_API_KEY

# Anthropic
provider: anthropic
model: claude-sonnet-4-20250514
api_key_env: ANTHROPIC_API_KEY

Add your API key to .env:

OPENAI_API_KEY=sk-abc123...

To add a custom provider, edit tools/model_client.py. It's a single file with an elif block per provider.

Run duration

Control how long the loop runs via CLI flags or config.yaml:

python3 tools/run_loop.py --iterations 20
python3 tools/run_loop.py --hours 2.5

Or in config.yaml:

max_iterations: 20    # stop after 20 experiments
max_hours: 2.5        # stop after 2.5 hours

If both are set, whichever limit is hit first stops the loop. Set both to 0 for unlimited.

Custom deterministic metrics (advanced)

By default, AutoEvaluation uses LLM-as-judge for all evaluation. If you want rule-based metrics too:

Create a custom tools/eval_deterministic.py that returns JSON:

{"metric_name": {"score": 0.85, ...}, "another_metric": {"score": 0.92, ...}}

Add them to config.yaml:

deterministic_metrics:
  - name: metric_name
    weight: 0.15
  - name: another_metric
    weight: 0.10

See examples/writing-style/ for a full example with 9 deterministic metrics.

Advanced features

Separate judge model

By default, the same model generates outputs and evaluates them. This creates self-judging bias (the tool will warn you about this). Use a different model for evaluation:

judge_provider: openai
judge_model: gpt-4o
judge_api_key_env: OPENAI_API_KEY

If these keys are absent, the primary provider is used as a fallback, with a warning.

Semi-blind judge

The judge normally evaluates outputs blind. Enable se

AutoEvaluation

Install / Use

README

AutoEvaluation

How it works

What makes this different

Real results

Quick start

Prerequisites

One command start

Try the included example

Point at your own skill

Setup wizard

With Claude Code (autonomous)

Watch scores in real time

How the optimiser thinks

Test prompts

BYO model

Run duration

Custom deterministic metrics (advanced)

Advanced features

Separate judge model

Semi-blind judge