Evolution

Evolution: Multi-agent code evolution platform. Spawn autonomous AI agents that research, implement, and compete to optimize your codebase. Heterogeneous runtimes (Claude Code, Codex, OpenCode), shared knowledge, human-in-the-loop steering from mobile.

Generate Convert Improve

Install / Use

/learn @aadi-labs/Evolution

About this skill

Quality Score

0/100

README

Evolution

A platform for autonomous multi-agent code evolution. Give it a codebase and a scoring function. Evolution spawns AI agents that research, implement, evaluate, and share knowledge to push the score as high or low as it can go.

Why Evolution

Most multi-agent systems are either too simple or too rigid. A single agent editing one file in a loop can't handle real codebases. A fixed mutation-and-selection pipeline can't adapt when the problem requires research, hypothesis testing, or coordinated exploration. And systems that treat agents as independent optimizers waste cycles. Agents diverge into separate lineages, most of them improving fundamentally worse code.

Evolution is different in a few ways that matter:

Collective intelligence, not independent optimization. Agents share a structured knowledge graph (not just notes, but hypotheses with evidence tracking, reusable skills, and tagged observations). The collective learns from individual failures. Dead ends posted by one agent save hours for others.
Periodic convergence to the best-known baseline. Heartbeat-triggered convergence rebases all agents to the highest-scoring code at regular intervals. Agents explore freely between convergence points, but the population never drifts too far from what's actually working.
Metacognitive self-improvement through reflection. Agents don't just optimize the code; they improve how they optimize. Heartbeat prompts force reflection on what's working. Published skills change future agent behavior. The improvement process improves itself, expressed through natural language rather than source-code edits.
Multi-metric and composite scoring. Real problems have trade-offs. Evolution supports weighted sums, Pareto dominance, min-rank, and all-must-improve ranking. Agents see per-component breakdowns, not just a single scalar.
Research phases before implementation. Sessions can block eval submissions during an initial exploration phase, forcing agents to study the problem, read papers, and share findings before anyone starts coding.
Human steering without hand-holding. Live message injection, agent pause/resume/spawn/kill, and a superagent pattern for remote monitoring. The system runs autonomously but is built for human guidance at key moments.
Heterogeneous runtimes. Claude Code, Codex, and OpenCode agents work the same problem simultaneously. Runtime-specific options (permission modes, model settings) are configurable per-agent without polluting the shared schema.
Full-codebase, not single-file. Agents get a complete copy of the repo with shared dependencies (symlinked venvs at zero disk cost), not a sandboxed single file. They can modify anything, run the full test suite, and use any tool available in the project.

What It Does

You provide:

A codebase to optimize
A grading script that outputs a score
Skills and tools to use in the codebase and research (optional)
A YAML config describing the session

Evolution provides:

Isolated git worktrees per agent (no interference between agents)
A shared knowledge hub (attempts, notes, reusable skills, hypotheses)
A manager that serializes evaluation, tracks scores, and detects stagnation
Periodic convergence: all agents rebase to the best-scoring code
Live human control: message agents, pause/resume, spawn new agents mid-session
Heterogeneous runtimes: Claude Code, Codex, and OpenCode agents can work on the same problem simultaneously
Multi-metric ranking: weighted sum, Pareto, min-rank, all-must-improve
Research phases that block eval during exploration
Session chaining for multi-day research

Agents autonomously:

Read the codebase, research approaches, implement changes
Submit evaluations and receive scores + feedback
Share findings ("this worked", "this is a dead end", "try this technique")
Read each other's notes to avoid duplicating work
Diff and cherry-pick files from each other's worktrees
Track and test hypotheses with structured evidence
Reflect during heartbeats, publish reusable skills, adjust strategy

How It Works

┌─────────────────────────────────────────────────────┐
│                    Manager                          │
│  Spawns agents · Runs grading · Tracks scores       │
│  Delivers messages · Monitors health                │
│  Unix socket: .evolution/manager.sock               │
└────────┬──────────┬──────────┬──────────┬───────────┘
         │          │          │          │
    ┌────▼───┐ ┌────▼───┐ ┌───▼────┐ ┌───▼────┐
    │Agent 1 │ │Agent 2 │ │Agent 3 │ │Agent 4 │
    │Claude  │ │Claude  │ │Codex   │ │OpenCode│
    │Code    │ │Code    │ │        │ │        │
    └────┬───┘ └────┬───┘ └───┬────┘ └───┬────┘
         │          │         │          │
         └──────────┴────┬────┴──────────┘
                         │
              ┌──────────▼──────────┐
              │   Shared Knowledge  │
              │  attempts/ notes/   │
              │  skills/ configs/   │
              └─────────────────────┘

Five components:

1. Task & Evaluation. A user-defined codebase and grading function. The grader runs in each agent's worktree and outputs a numeric score. Evolution is task-agnostic: it works on anything with a measurable outcome.

2. Manager Infrastructure. Spawns agents into isolated git worktrees, runs evaluation when agents submit, tracks the leaderboard, delivers messages, detects stagnation, and persists session state. All shared-state writes go through the manager via a Unix domain socket. Zero race conditions by design.

3. Agent Pool. Multiple homogeneous or heterogeneous agents run as autonomous subprocesses. Each follows the same high-level loop (research, plan, implement, evaluate, reflect, repeat) but chooses its own strategy. Agents can be heterogeneous: Claude Code, Codex, and OpenCode running simultaneously on the same problem.

4. Shared Knowledge Layer. Four types of persistent knowledge, stored as markdown files with YAML frontmatter:

Attempts: every evaluation with score, description, and grader feedback
Notes: agent observations, findings, warnings, and proposals (with structured tags: technique, dead-end, paper, competitor)
Skills: reusable techniques that agents publish for others
Hypotheses: structured predictions that agents track, test, and resolve with evidence

5. Heartbeat Mechanism. Multiple named heartbeats at different frequencies: reflect after every eval, consolidate every 5, converge every 10. Each fires independently. The converge heartbeat triggers population-level convergence: all agents rebase to the best-scoring agent's code. Regular heartbeats force reflection and knowledge sharing.

The Agent Loop

Each agent runs autonomously:

1. Check inbox for leaderboard updates, claims, and directives
2. Check open hypotheses: evolution hypothesis list --status open
3. Check claims: evolution claims (see what others are working on)
4. Claim work: evolution note add "WORKING ON: X" --tags working-on
5. Research: read papers, diff other agents, cherry-pick good files
6. Implement: make targeted changes
7. Test: run the test suite, verify no regressions
8. Evaluate: evolution eval -m "description" (queued, non-blocking)
9. Share: post findings, resolve hypotheses, warn about dead ends
10. Repeat

Agents communicate through the shared knowledge hub, not direct messaging. This creates a naturally asynchronous collaboration pattern. Agents don't block each other, and late-joining agents can catch up by reading the accumulated knowledge.

Cross-Pollination

Agents can see and steal each other's work:

evolution claims                              # who's working on what?
evolution diff agent-2                        # what did agent-2 change?
evolution cherry-pick agent-1 src/retriever.py  # copy agent-1's file into my worktree

This solves the biggest problem in multi-agent sessions: agents rebuilding what someone else already built. With diff and cherry-pick, good implementations propagate across the pool without requiring agents to read each other's notes.

Human-in-the-Loop

Evolution is not fire-and-forget. Humans can steer sessions in real time:

# See what's happening
evolution status
evolution attempts list
evolution notes list

# Guide agents
evolution msg --all "retrieval is not the bottleneck — focus on answer quality"
evolution msg agent-2 "try the batch embedding API to avoid rate limits"

# Manage the pool
evolution pause agent-3          # Pause an underperforming agent
evolution spawn --clone agent-1  # Clone a high-performing agent
evolution kill agent-4           # Remove an agent entirely

# Stop when satisfied
evolution stop

Merging the Winner

When the session ends, merge the best agent's work into a branch:

# Preview what would be merged
evolution merge --dry-run

# Create a branch with the winning agent's changes + changelog
evolution merge --branch evolution/my-feature

# Merge a specific agent (not just the best)
evolution merge --agent agent-3 --branch evolution/agent-3-approach

The merge command creates a new branch from HEAD, copies the winning agent's changed files, and commits with a generated changelog: top attempts, key findings from notes, and hypothesis resolutions. You can then review the branch and open a PR.

The Best Way to Run Evolution

The recommended way to run Evolution is to let a Claude Code instance manage the session for you. Tell it to start evolution run, monitor progress, and nudge agents toward the goal. Claude Code becomes your superagent. It reads the leaderboard, checks agent notes, sends course corrections, and escalates when something needs your attention.

This matters because Claude Code supports remote control. Once the session is running, you can connect from your phone, a tablet, or any browser to chec

Related Skills

proje

Interactive vocabulary learning platform with smart flashcards and spaced repetition for effective language acquisition.

YC-Killer

2.7k

A library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.

best-practices-researcher

The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app

flutter-tutor

Flutter Learning Tutor Guide You are a friendly computer science tutor specializing in Flutter development. Your role is to guide the student through learning Flutter step by step, not to provide d

aadi-labs

View profile

View on GitHub

GitHub Stars10

CategoryEducation

Updated2d ago

Forks1

aadi-labs/evolution

Languages

Python

Security Score

90/100

Audited on Apr 3, 2026

No findings