SkyDiscover

</h1> A Flexible Framework for AI-Driven Scientific and Algorithmic Discovery <a href="https://skydiscover-ai.github.io/blog.html"><img src="https://img.shields.io/badge/blog-SkyDiscover-orange?style=flat-square" alt="Blog" /></a> <a href="https://arxiv.org/abs/2602.20133"><img src="https://img.shields.io/badge/paper-AdaEvolve-red?style=flat-square" alt="AdaEvolve Paper" /></a> <a href="https://arxiv.org/abs/2602.23413"><img src="https://img.shields.io/badge/paper-EvoX-lightblue?style=flat-square" alt="EvoX Paper" /></a> <a href="LICENSE"><img src="https://img.shields.io/badge/license-Apache--2.0-green?style=flat-square" /></a> <img src="assets/architecture.png" width="720" alt="SkyDiscover architecture">

SkyDiscover is a modular framework for AI-driven scientific and algorithmic discovery, providing a unified interface for implementing, running, and fairly comparing discovery algorithms across 200+ optimization tasks.

SkyDiscover introduces two new adaptive optimization algorithms:

AdaEvolve, which dynamically adjusts its optimization behavior based on observed progress.
EvoX, which dynamically evolves the optimization (evolution) strategy itself using LLMs on the fly.

SkyDiscover also supports using OpenEvolve, ShinkaEvolve and GEPA to quickly benchmark these algorithms using their own source code. SkyDiscover also hosts native versions of OpenEvolve and GEPA under openevolve_native and gepa_native algorithms using the modular interface.

SkyDiscover natively supports Harbor-format benchmarks, so you can run external benchmark suites out of the box, including AlgoTune, EvoEval, HumanEvalFix, BigCodeBench, LiveCodeBench, USACO, CRUSTBench, and CodePDE.

🚧 This project is under active development.

🏆 Benchmark Performance

Across ~200 optimization benchmarks, AdaEvolve and EvoX achieve the strongest open-source results — matching or exceeding AlphaEvolve and human SOTA, and outperforming OpenEvolve, GEPA, and ShinkaEvolve under identical generation budgets.

Frontier-CS (172 problems): ~34% median score improvement over OpenEvolve, GEPA, and ShinkaEvolve
Math + Systems Optimization (12 tasks): Matches or exceeds AlphaEvolve and human-designed SOTA on 6/6 systems and 6/8 math tasks
Real-world systems impact: 41% lower cross-cloud transfer cost, 14% better GPU load balance for MoE serving, and 29% lower KV-cache pressure via GPU model placement

For a detailed breakdown of results for each algorithm, see the respective papers.

<img src="assets/benchmarks.png" width="900" alt="SkyDiscover benchmarks"> <details> <summary>Task breakdown across math, systems, and programming challenges</summary>

| | Benchmark | Domain | Tasks | Description | |-|-----------|--------|------:|-------------| | 🔢 | math/ | Math | 14 | Circle packing, Erdos problems, geometric optimization | | 🖥️ | ADRS/ | Systems | 5 | Cloud scheduling, load balancing, MoE expert placement | | ⚡ | gpu_mode/ | Systems | 4 | GPU kernel optimization | | 🧩 | frontier-cs-eval/ | Algorithms | 172 | Frontier-CS competitive programming | | 🧠 | arc_benchmark/ | Reasoning | — | ARC-AGI visual reasoning | | 💻 | ale_bench/ | Algorithms | 10 | Algorithmic programming contests | | 🎨 | image_gen/ | Creative | 1 | AI image generation evolution | | 💬 | prompt_optimization/ | NLP | 1 | HotPotQA prompt evolution |

See Dependency extras for install commands per benchmark.

</details>

🚀 Quick Start

Prerequisites: Python >= 3.10, uv

# Install
uv sync
export OPENAI_API_KEY="<your-key>"

# Try the circle packing benchmark
uv sync --extra math
uv run skydiscover-run benchmarks/math/circle_packing/initial_program.py \
  benchmarks/math/circle_packing/evaluator.py \
  --config benchmarks/math/circle_packing/config.yaml \
  --search evox \
  --iterations 100

uv run skydiscover-run benchmarks/math/circle_packing/initial_program.py \
  benchmarks/math/circle_packing/evaluator.py \
  --config benchmarks/math/circle_packing/config.yaml \
  --search adaevolve \
  --iterations 100

# Or run on your own problem
# algo can be "evox", "adaevolve", "openevolve", "gepa", "shinkaevolve"
uv run skydiscover-run initial_program.py evaluator.py \
  --search <algo> \
  --model gpt-5 \
  --iterations 100

# initial_program is optional — omit it to let the LLM start from scratch
uv run skydiscover-run evaluator.py \
  --search <algo> \
  --model gpt-5 \
  --iterations 100

# Run a Harbor benchmark (e.g. AlgoTune) — no seed program needed
pip install harbor
harbor datasets download algotune@1.0 -o /tmp/algotune
uv run skydiscover-run /tmp/algotune/<id>/algotune-set-cover \
  --model anthropic/claude-sonnet-4-6 \
  --search best_of_n -i 10

Or use the Python API:

from skydiscover import run_discovery

result = run_discovery(
    initial_program="initial_program.py",
    evaluator="evaluator.py",
    search=[algo], # algo can be "adaevolve", "evox", "openevolve", "gepa", "shinkaevolve"
    model="gpt-5",
    iterations=100,
)

print(result.best_score, result.best_solution)

✏️ What You Write

Scoring Function (required)

SkyDiscover supports three evaluator formats — pick whichever fits your use case:

| Format | When to use | What you point evaluation_file at | |:---|:---|:---| | Python function | Simple tasks, no system deps | evaluator.py | | Containerized | Custom deps, data files, isolation | evaluator/ directory (must contain Dockerfile + evaluate.sh) | | Harbor task | External benchmark suites (AlgoTune, EvoEval, HumanEvalFix, BigCodeBench, LiveCodeBench, USACO, CRUSTBench, CodePDE, and more) | Task directory (must contain instruction.md + tests/ + environment/Dockerfile) |

SkyDiscover auto-detects the format. See benchmarks/README.md for full setup instructions.

Python evaluator — a file with an evaluate(program_path) function:

def evaluate(program_path):
    score = run_and_grade(program_path)
    return {
        "combined_score": score,       # primary optimization target (maximized)
        "artifacts": {                 # optional — stored with the solution for future context
            "feedback": "Off by one in the loop boundary",
        },
    }

Containerized evaluator — a directory with a Dockerfile and evaluate.sh that writes JSON to stdout. Runs in Docker, so it can have arbitrary dependencies.

Harbor task — a directory following the Harbor format (instruction.md, environment/Dockerfile, tests/test.sh). Works out of the box with 8+ tested benchmark suites (see benchmarks/README.md for the full list).

combined_score drives evolution. If omitted, SkyDiscover averages all numeric values in the dict.
artifacts is optional — entries are injected into the next LLM prompt as context.

For search.type: adaevolve, you can also enable explicit Pareto optimization by configuring search.database.pareto_objectives and returning those objective metrics directly from the evaluator. In that mode, combined_score becomes optional and is only used as a scalar fallback/proxy when configured.

Starting Solution (optional)

The initial program is optional. When omitted, the LLM generates a solution from scratch. If provided, it marks the region to mutate with EVOLVE-BLOCK markers. Everything outside is left untouched.

# EVOLVE-BLOCK-START
def solve(input_data):
    return input_data  # baseline — SkyDiscover will improve this
# EVOLVE-BLOCK-END

If no markers are present, the entire file is treated as mutatable.

🧬 Pick an Algorithm

| Algorithm | Flag | Description | |:---|:---|:---| | ⭐ AdaEvolve | --search adaevolve | Multi-island adaptive search with UCB, migration, and paradigm breakthroughs | | 🧠 EvoX | --search evox | Self-evolving paradigm that co-adapts solution generation and experience management | | 📊 Top-K | --search topk | Selects top-K solutions to refine | | 🔍 Beam Search | --search beam_search | Breadth-first expansion of a beam of top solutions | | 🎲 Best-of-N | --search best_of_n | Generates N variants per iteration, keeps the best | | 🧪 GEPA Native | --search gepa_native | Pareto-efficient search with reflective prompting and LLM-mediated merge | | 🗺️ OpenEvolve Native | --search openevolve_native | MAP-Elites + island-based evolutionary search |

External backends

Install with uv sync --extra external, then use the corresponding flag:

| Backend | Flag | Source | |:---|:---|:---| | OpenEvolve | --search openevolve | codelion/openevolve | | GEPA | --search gepa | gepa-ai/gepa | | ShinkaEvolve | --search shinkaevolve | [SakanaAI/ShinkaEvolve](https://github.com/SakanaAI/ShinkaEvol

Skydiscover

Install / Use

README