Cavekit
A Claude Code plugin that turns natural language into blueprints, blueprints into parallel build plans, and build plans into working software with automated iteration, validation, and cross-model peer review.
Install / Use
/learn @JuliusBrussee/CavekitQuality Score
Category
Development & EngineeringSupported Platforms
README
A Claude Code plugin that turns natural language into specs, specs into parallel build plans, and build plans into working software — with automated iteration, validation, and dual-model adversarial review.
You describe what you want. Cavekit writes the contract. Agents build from the contract. Every line of code traces to a requirement. Every requirement has acceptance criteria. Nothing gets lost, nothing gets guessed.
Before / After
<table> <tr> <td width="50%">Without Cavekit
> Build me a task management API
(agent writes 2000 lines)
(no tests)
(forgot the auth middleware)
(wrong database schema)
(you spend 3 hours fixing it)
One shot. No validation. No traceability. The agent guessed what you wanted.
</td> <td width="50%">With Cavekit
> /ck:sketch
4 kits, 22 requirements, 69 criteria
> /ck:map
34 tasks across 5 dependency tiers
> /ck:make
18 iterations — each validated against
the spec before committing
CAVEKIT COMPLETE
Every requirement traced. Every criterion checked.
</td> </tr> </table>Same feature. Zero guesswork. Full traceability.
The Problem
AI coding agents are powerful, but they fail the same way every time:
| Failure | What Happens | |---------|-------------| | Context loss | Agent forgets what it said three steps ago | | No validation | Code written, never verified against intent | | No parallelism | One agent, one task, one branch — even when work is independent | | No iteration | Single pass produces a rough draft, not production code |
Cavekit fixes all four.
The Idea
Instead of "prompt and pray," Cavekit puts a specification layer between your intent and the code.
┌─── Task 1 ─── Agent A ───┐
│ │
You ── /ck:sketch ──► Kits ── /ck:map ──► Build Site ──┤─── Task 2 ─── Agent B ───┤──► done
│ │
└─── Task 3 ─── Agent C ───┘
Kits are the source of truth. Agents read them, build from them, validate against them. When something breaks, the system traces the failure back to the kit — not the code.
Spec is the product. Code is the derivative.
Install
git clone https://github.com/JuliusBrussee/cavekit.git ~/.cavekit
cd ~/.cavekit && ./install.sh
Registers the plugin with Claude Code, syncs into Codex marketplace, installs the cavekit CLI. Restart Claude Code after installing.
Requires: Claude Code, git, macOS/Linux.
Optional: Codex (npm install -g @openai/codex) — adds adversarial review. Cavekit works without it. Codex makes it significantly harder to ship flawed specs and broken code.
How It Works
Four phases. Each one a slash command.
RESEARCH DRAFT ARCHITECT BUILD INSPECT
──────── ───── ───────── ───── ───────
(optional) "What are we Break into tasks, Auto-parallel: Gap analysis:
Multi-agent building?" map dependencies, /ck:make built vs.
codebase + organize into groups work intended.
web research Produces: tiered build site into adaptive Peer review.
kits with + dependency graph subagent packets Trace to specs.
Produces: R-numbered tier by tier
research brief requirements Produces: Produces:
task graph Codex reviews findings report
Codex challenges every tier gate
the design
0. Research — ground the design (optional)
/ck:research "build a C+ compiler"
Dispatches 2–8 parallel subagents to explore the codebase and search the web for best practices, library landscape, reference implementations, and common pitfalls. A synthesizer agent cross-validates findings and produces a research brief in context/refs/.
/ck:design — establish the design system
/ck:design
Creates or imports a DESIGN.md design system — a cross-cutting constraint layer enforced across the entire pipeline. Every kit references its design tokens, every task carries a Design Ref, every build result is audited for violations.
| Sub-command | What it does |
|------------|-------------|
| /ck:design create | Generate new DESIGN.md via guided Q&A |
| /ck:design import | Extract DESIGN.md from existing codebase |
| /ck:design audit | Check implementation against DESIGN.md |
| /ck:design update | Revise DESIGN.md, log to changelog |
1. Draft — define the what
/ck:sketch
Describe what you're building in natural language. Cavekit decomposes it into domain kits — structured documents with numbered requirements (R1, R2, ...) and testable acceptance criteria. Stack-independent. Human-readable.
After internal review, kits go to Codex for a design challenge — adversarial review that catches decomposition flaws, missing requirements, and ambiguous criteria before any code is written.
For existing codebases: /ck:sketch --from-code reverse-engineers kits from your code and identifies gaps.
2. Architect — plan the order
/ck:map
Reads all kits. Breaks requirements into tasks. Maps dependencies. Organizes into a tiered build site — a dependency graph where Tier 0 has no deps, Tier 1 depends only on Tier 0, and so on. Includes a Coverage Matrix mapping every acceptance criterion to its task(s). Nothing specified gets lost in translation.
3. Build — run the loop
/ck:make
Pre-flight coverage check validates all acceptance criteria are covered. Then the loop runs:
┌──────────────────────────────────────────────────────┐
│ │
│ Read build site → Find next unblocked task │
│ │ │
│ ▼ │
│ Load relevant kit + acceptance criteria │
│ │ │
│ ▼ │
│ Implement the task │
│ │ │
│ ▼ │
│ Validate (build + tests + acceptance criteria) │
│ │ │
│ ├── PASS → commit → mark done → next ──┐ │
│ │ │ │
│ └── FAIL → diagnose → fix → revalidate │ │
│ │ │
│ ◄────────────────────────────────────────────┘ │
│ │
│ Loop until: all tasks done OR limit reached │
└──────────────────────────────────────────────────────┘
At every tier boundary, Codex adversarial review gates advancement. P0/P1 findings must be fixed before the next tier starts. With speculative review (default), this adds near-zero latency.
Post-flight verification cross-references what was built against original kits. Gaps get remediation tasks.
4. Inspect — verify the result
/ck:check
Gap analysis: built vs. specified. Peer review: bugs, security, missed requirements. Everything traced back to kit requirements.
Quick Start
Greenfield:
> /ck:sketch
What are you building?
> A REST API for task management. Users, projects, tasks
with priorities and due dates. PostgreSQL.
Created 4 kits (22 requirements, 69 acceptance criteria)
Next: /ck:map
> /ck:map
Generated build site: 34 tasks, 5 tiers
Next: /ck:make
> /ck:make
Loop activated — 34 tasks, 20 max iterations.
...
All tasks done. Build passes. Tests pass.
CAVEKIT COMPLETE — 34 tasks in 18 iterations.
Existing codebase:
> /ck:sketch --from-code
Exploring codebase... Next.js 14, Prisma, NextAuth.
Created 6 kits — 4 requirements are gaps (not yet implemented).
> /ck:map --filter collaboration
Generated build site: 8 tasks, 3 tiers
> /ck:make
CAVEKIT COMPLETE — 8 tasks in 8 iterations.
See example.md for
