Temper
Temper is a plugin for Claude Code that closes the quality gap in AI-generated code
Install / Use
/learn @galando/TemperQuality Score
Category
Development & EngineeringSupported Platforms
README
Temper
Your AI writes fast. Temper makes it last.
Intent-driven development with behavioral testing, security analysis, and quality gates for AI-generated code
<img src="docs/temper.png" alt="Temper Dashboard" width="100%">Website | Getting Started | Releases
</div>
The Problem
AI writes code fast. But "fast" without "right" creates bugs, technical debt, and features that miss the point.
"Why not just tell Claude to be careful?"
You can. And it helps. But AI-generated code has structural failure patterns that "be careful" doesn't address. These aren't sloppiness — they're limitations of how LLMs generate code:
- Missing behaviors — AI builds the happy path, skips edge cases. Rate limiting? Error recovery? Never implemented.
- Wrong problem solved — Feature works perfectly, but nobody asked for it. All tests pass, wrong thing built.
- Over-engineering — AI creates factories, strategies, and abstractions for something used exactly once.
- Hallucinated APIs — AI calls methods that don't exist. It's confident they do.
- Missing wiring — New code never registered in routing, DI, or config. The code itself is correct; the integration is missing.
These map to three unanswered questions:
| Question | What Goes Wrong Without It | |----------|---------------------------| | Did we solve the problem? | Feature works but nobody uses it. Wrong problem solved. | | Does it do the right things? | Happy path works, edge cases ship broken. | | Does the code work? | Tests pass, but they test implementation details, not behaviors. |
Most AI tools answer only the third. Temper answers all three.
IDD + BDD + TDD: Three Layers, One File
Temper combines three development methodologies in a single artifact called intent.md. Each layer answers a different question and is enforced at a different stage of the pipeline:
intent.md
|
+-- Intent Section (IDD) WHY are we building this?
| | Problem statement
| | Success criteria (each with a Validate: type)
| | Constraints
| |
+-- Scenarios Section (BDD) WHAT should it do?
| Gherkin Given/When/Then
| Derived BEFORE architecture
| Every planned file traces to a scenario
|
+-- /temper:build (TDD) HOW do we build it?
Tests written from scenarios
RED -> GREEN -> REFACTOR
IDD: Intent-Driven Development
Question: Did we solve the problem?
When: Defined during /temper:plan, validated during /temper:review
IDD captures the why behind a feature. Not "add a password reset endpoint" but "users should be able to reset their password without contacting support, completing the flow in under 2 minutes."
The Intent section of intent.md contains:
- Problem — What problem are we solving? For whom?
- Success Criteria — Measurable outcomes, each with a
Validate:type that tells review how to check it - Constraints — Technical or business limitations
- Target Users — Who benefits
Validate Types
Each success criterion gets a validation type. This is what makes IDD mechanical instead of subjective:
| Type | What It Means | How Review Checks It | Example |
|------|--------------|---------------------|---------|
| scenario | Criterion is satisfied when a linked BDD scenario's test passes | Finds the test, runs it, checks PASS | "Users can reset password" -> linked to scenario "Successful password reset" |
| code | Criterion is satisfied when specific code exists | Greps the codebase for the pattern | "POST /api/reset endpoint exists" -> greps for route definition |
| metric | Can't be verified before deployment | Flags for post-deploy monitoring | "Support tickets decrease 30%" -> requires production data |
| manual | Requires human judgment | Flags for human review, non-blocking | "Reset flow feels intuitive" -> UX review needed |
Why this matters: Without validate types, "intent validation" means the AI reads your success criteria and subjectively judges "yeah, this looks met." With validate types, most criteria are mechanically verified — a test passes or it doesn't, code exists or it doesn't. Only metric and manual require judgment.
Intent Validation in Review
When /temper:review runs, it produces:
Intent Validation (IDD): 4/5 (3 mechanical, 1 deferred, 1 manual)
Problem: Users unable to reset passwords without support
[x] Users can reset password without support
validate: scenario -> test_successful_reset PASS
[x] Reset endpoint exists at POST /api/reset
validate: code -> route found in AuthController.ts:23
[x] Rate limiting prevents abuse
validate: scenario -> test_rate_limiting PASS
[ ] Support ticket volume decreases 30%
validate: metric -> post-deploy monitoring required
[ ] Reset flow completes in under 2 minutes
validate: manual -> requires human review
Confidence: 3/5 mechanically verified
The higher the ratio of scenario/code criteria, the more confidence you have that the feature actually solves the stated problem.
BDD: Behavior-Driven Development
Question: Does it do the right things?
When: Scenarios derived during /temper:plan (before architecture), enforced during /temper:build
BDD in Temper isn't an afterthought — scenarios are derived before the architecture exists. This is the key design decision. The flow is:
1. Blast radius analysis -> identifies affected files and risk areas
2. Scenario derivation -> behaviors from requirements + blast radius
3. Architecture from scenarios -> file list justified by scenarios
Not the other way around. This prevents the AI from planning 15 files and then writing scenarios that justify them. Instead, scenarios define what the system must do, and the file list follows.
Where Scenarios Come From
Scenarios aren't invented — they're derived from concrete sources:
| Source | Becomes | |--------|---------| | Feature description | Happy path scenarios | | Acceptance criteria (Jira/GitHub issue) | Validation scenarios | | Blast radius: risk areas | Edge case and error scenarios | | Blast radius: affected consumers | Regression guard scenarios ("existing X still works") |
File-to-Scenario Traceability
Every file in the plan must justify its existence:
Scenario-traced files:
src/services/PasswordResetService.ts -> Scenario: "Successful password reset"
src/middleware/RateLimiter.ts -> Scenario: "Rate limiting enforced"
Infrastructure files (no scenario needed, but must state dependency):
db/migrations/001_add_reset_tokens.sql -> Required by PasswordResetService
config/email.ts -> Required by PasswordResetService
If the AI plans a file that no scenario needs and isn't infrastructure — that file shouldn't exist. This is how Temper prevents over-engineering structurally, not by hoping the AI "keeps it simple."
Scenario Coverage Gate
After all tasks complete, /temper:build runs the scenario coverage gate:
Scenario Coverage: 5/5
[x] Successful password reset -> test_successful_reset (PASS)
[x] Expired token rejected -> test_expired_token (PASS)
[x] Rate limiting enforced -> test_rate_limiting (PASS)
[x] Invalid email format -> test_invalid_email (PASS)
[x] Non-existent user handled -> test_nonexistent_user (PASS)
If any scenario has no passing test, build cannot proceed. It writes the missing test, runs it, and implements the feature if the test fails. This is how the rate-limiting example works — the scenario existed in intent.md, no test covered it, so build caught the gap.
TDD: Test-Driven Development
Question: Does the code work?
When: During /temper:build, per scenario
TDD in Temper is scenario-driven. Instead of the AI deciding what to test, tests are derived from BDD scenarios:
| BDD Scenario | Becomes TDD |
|-------------|-------------|
| Given (preconditions) | Test setup |
| When (action) | Method/endpoint call |
| Then (expected outcome) | Assertions |
| Scenario name | Test name |
The cycle per scenario:
- RED — Write test mapped to scenario name. Run it. Must fail (proves the test actually tests something).
- GREEN — Write minimal code to make the test pass. Nothing more.
- REFACTOR — Clean up only if safe and obvious. All tests must still pass.
How TDD and BDD Work Together
When both intent.md and the TDD pack are active:
- intent.md drives WHAT to test — scenarios define the test cases
- TDD pack drives HOW to test — RED-GREEN-REFACTOR discipline, naming conventions, test structure
When only TDD pack is active (no intent.md — trivial/simple features):
- TDD pack drives both what and how — freestyle test-first development
When neither is active:
- No enforced test-first — implement, then test
This priority chain means intent.md and TDD aren't competing methodolo
