<p align="center"> <img src="https://em-content.zobj.net/source/apple/391/rock_1faa8.png" width="120" /> </p> <h1 align="center">cavekit</h1> <p align="center"> <strong>why agent guess when agent can know</strong> </p> <p align="center"> <a href="https://github.com/JuliusBrussee/cavekit/stargazers"><img src="https://img.shields.io/github/stars/JuliusBrussee/cavekit?style=flat&color=yellow" alt="Stars"></a> <a href="https://github.com/JuliusBrussee/cavekit/commits/main"><img src="https://img.shields.io/github/last-commit/JuliusBrussee/cavekit?style=flat" alt="Last Commit"></a> <a href="LICENSE"><img src="https://img.shields.io/github/license/JuliusBrussee/cavekit?style=flat" alt="License"></a> <a href="https://docs.anthropic.com/en/docs/claude-code"><img src="https://img.shields.io/badge/Claude_Code-plugin-blueviolet" alt="Claude Code Plugin"></a> </p> <p align="center"> <a href="#install">Install</a> • <a href="#before--after">Before/After</a> • <a href="#how-it-works">How It Works</a> • <a href="#quick-start">Quick Start</a> • <a href="#parallel-execution">Parallel Execution</a> • <a href="#codex-adversarial-review">Codex Review</a> • <a href="#commands">Commands</a> • <a href="example.md">Examples</a> </p> <p align="center"> Part of the <a href="https://github.com/JuliusBrussee/caveman">Caveman</a> ecosystem </p>

A Claude Code plugin that turns natural language into specs, specs into parallel build plans, and build plans into working software — with automated iteration, validation, and dual-model adversarial review.

You describe what you want. Cavekit writes the contract. Agents build from the contract. Every line of code traces to a requirement. Every requirement has acceptance criteria. Nothing gets lost, nothing gets guessed.

Before / After

Without Cavekit

> Build me a task management API

  (agent writes 2000 lines)
  (no tests)
  (forgot the auth middleware)
  (wrong database schema)
  (you spend 3 hours fixing it)

One shot. No validation. No traceability. The agent guessed what you wanted.

</td> <td width="50%">

With Cavekit

> /ck:sketch
  4 kits, 22 requirements, 69 criteria

> /ck:map
  34 tasks across 5 dependency tiers

> /ck:make
  18 iterations — each validated against
  the spec before committing

  CAVEKIT COMPLETE

Every requirement traced. Every criterion checked.

</td> </tr> </table>

Same feature. Zero guesswork. Full traceability.

The Problem

AI coding agents are powerful, but they fail the same way every time:

| Failure | What Happens | |---------|-------------| | Context loss | Agent forgets what it said three steps ago | | No validation | Code written, never verified against intent | | No parallelism | One agent, one task, one branch — even when work is independent | | No iteration | Single pass produces a rough draft, not production code |

Cavekit fixes all four.

The Idea

Instead of "prompt and pray," Cavekit puts a specification layer between your intent and the code.

                        ┌─── Task 1 ─── Agent A ───┐
                        │                           │
You ── /ck:sketch ──► Kits ── /ck:map ──► Build Site ──┤─── Task 2 ─── Agent B ───┤──► done
                        │                           │
                        └─── Task 3 ─── Agent C ───┘

Kits are the source of truth. Agents read them, build from them, validate against them. When something breaks, the system traces the failure back to the kit — not the code.

Spec is the product. Code is the derivative.

Install

git clone https://github.com/JuliusBrussee/cavekit.git ~/.cavekit
cd ~/.cavekit && ./install.sh

Registers the plugin with Claude Code, syncs into Codex marketplace, installs the cavekit CLI. Restart Claude Code after installing.

Requires: Claude Code, git, macOS/Linux.

Optional: Codex (npm install -g @openai/codex) — adds adversarial review. Cavekit works without it. Codex makes it significantly harder to ship flawed specs and broken code.

How It Works

Four phases. Each one a slash command.

  RESEARCH         DRAFT            ARCHITECT           BUILD              INSPECT
  ────────         ─────            ─────────           ─────              ───────
  (optional)       "What are we     Break into tasks,   Auto-parallel:     Gap analysis:
  Multi-agent       building?"      map dependencies,    /ck:make          built vs.
  codebase +                        organize into        groups work        intended.
  web research     Produces:        tiered build site    into adaptive      Peer review.
                   kits with        + dependency graph   subagent packets   Trace to specs.
  Produces:        R-numbered                            tier by tier
  research brief   requirements     Produces:                               Produces:
                                    task graph           Codex reviews      findings report
                   Codex challenges                      every tier gate
                   the design

0. Research — ground the design (optional)

/ck:research "build a C+ compiler"

Dispatches 2–8 parallel subagents to explore the codebase and search the web for best practices, library landscape, reference implementations, and common pitfalls. A synthesizer agent cross-validates findings and produces a research brief in context/refs/.

/ck:design — establish the design system

/ck:design

Creates or imports a DESIGN.md design system — a cross-cutting constraint layer enforced across the entire pipeline. Every kit references its design tokens, every task carries a Design Ref, every build result is audited for violations.

| Sub-command | What it does | |------------|-------------| | /ck:design create | Generate new DESIGN.md via guided Q&A | | /ck:design import | Extract DESIGN.md from existing codebase | | /ck:design audit | Check implementation against DESIGN.md | | /ck:design update | Revise DESIGN.md, log to changelog |

1. Draft — define the what

/ck:sketch

Describe what you're building in natural language. Cavekit decomposes it into domain kits — structured documents with numbered requirements (R1, R2, ...) and testable acceptance criteria. Stack-independent. Human-readable.

After internal review, kits go to Codex for a design challenge — adversarial review that catches decomposition flaws, missing requirements, and ambiguous criteria before any code is written.

For existing codebases: /ck:sketch --from-code reverse-engineers kits from your code and identifies gaps.

2. Architect — plan the order

/ck:map

Reads all kits. Breaks requirements into tasks. Maps dependencies. Organizes into a tiered build site — a dependency graph where Tier 0 has no deps, Tier 1 depends only on Tier 0, and so on. Includes a Coverage Matrix mapping every acceptance criterion to its task(s). Nothing specified gets lost in translation.

3. Build — run the loop

/ck:make

Pre-flight coverage check validates all acceptance criteria are covered. Then the loop runs:

  ┌──────────────────────────────────────────────────────┐
  │                                                      │
  │  Read build site → Find next unblocked task          │
  │       │                                              │
  │       ▼                                              │
  │  Load relevant kit + acceptance criteria             │
  │       │                                              │
  │       ▼                                              │
  │  Implement the task                                  │
  │       │                                              │
  │       ▼                                              │
  │  Validate (build + tests + acceptance criteria)      │
  │       │                                              │
  │       ├── PASS → commit → mark done → next ──┐      │
  │       │                                       │      │
  │       └── FAIL → diagnose → fix → revalidate  │      │
  │                                               │      │
  │  ◄────────────────────────────────────────────┘      │
  │                                                      │
  │  Loop until: all tasks done OR limit reached         │
  └──────────────────────────────────────────────────────┘

At every tier boundary, Codex adversarial review gates advancement. P0/P1 findings must be fixed before the next tier starts. With speculative review (default), this adds near-zero latency.

Post-flight verification cross-references what was built against original kits. Gaps get remediation tasks.

4. Inspect — verify the result

/ck:check

Gap analysis: built vs. specified. Peer review: bugs, security, missed requirements. Everything traced back to kit requirements.

Quick Start

Greenfield:

> /ck:sketch
What are you building?

> A REST API for task management. Users, projects, tasks
  with priorities and due dates. PostgreSQL.

Created 4 kits (22 requirements, 69 acceptance criteria)
Next: /ck:map

> /ck:map
Generated build site: 34 tasks, 5 tiers
Next: /ck:make

> /ck:make
Loop activated — 34 tasks, 20 max iterations.
...
All tasks done. Build passes. Tests pass.
CAVEKIT COMPLETE — 34 tasks in 18 iterations.

Existing codebase:

> /ck:sketch --from-code
Exploring codebase... Next.js 14, Prisma, NextAuth.
Created 6 kits — 4 requirements are gaps (not yet implemented).

> /ck:map --filter collaboration
Generated build site: 8 tasks, 3 tiers

> /ck:make
CAVEKIT COMPLETE — 8 tasks in 8 iterations.

See example.md for

Cavekit

Install / Use

README