Kodezi Chronos

The World's First Debugging-First Language Model for Repository-Scale Code Understanding

Performance Badges

Key Achievements

80.33% SWE-bench Lite • 67.3% Autonomous Debugging • 89% Human Preference • 40% Time Reduction

State-of-the-Art Results
MRR Benchmark Results
Key Innovations
Architecture
Benchmarks & Evaluation
Research Paper
Getting Started
Repository Structure
Research Highlights
Detailed Performance
Documentation
Contributing
Citation
License

Model Access Notice

Chronos is proprietary and available exclusively through Kodezi OS

| Timeline | Access | Details | |:--------:|:------:|:-------:| | Q4 2025 | Beta | Limited enterprise access | | Q1 2026 | GA | Via Kodezi OS |

This repository contains research paper, benchmarks, and evaluation results only.

Get Early Access • Read Paper • View Leaderboard • Documentation

</div>

🏅 State-of-the-Art Results

📈 SWE-bench Lite Performance

Industry-Standard Benchmark Results

| Rank | System | Success Rate | Instances | Lead | Year | |:----:|:-------|:------------:|:---------:|:----:|:----:| | 1 | Kodezi Chronos | 80.33% | 241/300 | +20.0pp | 2025 | | 2 | ExpeRepair-v1.0 + Claude 4.5 Sonnet | 60.33% | 181/300 | - | 2025 | | 3 | Claude 4.5 Sonnet (Bash Only) | ~14% | ~42/300 | -66.3pp | 2025 | | 4 | Claude 4.1 Opus (Bash Only) | 14.2% | 43/300 | -66.1pp | 2025 | | 5 | GPT-4.1 | 13.8% | 41/300 | -66.5pp | 2025 | | 6 | Gemini 2.0 Pro | 13.4% | 40/300 | -67.0pp | 2025 |

20 percentage point absolute lead over second place

</div>

The Debugging Gap

General-Purpose Models: Code Generation vs Debugging Performance

| Model | SWE-bench Full<br/>(Code Gen) | SWE-bench Lite<br/>(Debugging) | Performance Gap | |:------|:-----------------------------:|:------------------------------:|:---------------:| | Claude 4.5 Sonnet | 72.7% | ~14% | -58.7pp | | Claude 4.1 Opus | 72.5% | 14.2% | -58.3pp | | Claude 4.1 Opus (Bash) | 67.60% | 14.2% | -53.4pp | | GPT-4.1 | 54.6% | 13.8% | -40.8pp | | Kodezi Chronos | N/A | 80.33% | Specialized |

Key Insight: Even models achieving 70%+ on code generation drop to <15% on debugging tasks, revealing a 50+ percentage point gap. Chronos, purpose-built for debugging, achieves 80.33%—demonstrating that debugging requires specialized architectures, not just larger context windows.

</div>

Repository-Specific Results

SWE-bench Lite: Domain-Specific Performance

| Repository | Domain | Chronos Success | Instances | Significance | |:-----------|:-------|:---------------:|:---------:|:-------------| | sympy | Symbolic Mathematics | 96.1% | 51/53 | Near-perfect mathematical reasoning | | sphinx | Documentation Systems | 93.8% | 60/64 | Exceptional doc generation bugs | | django | Web Frameworks | 90.4% | 104/115 | Complex framework debugging | | Overall | Mixed Domains | 80.33% | 241/300 | State-of-the-art |

</div>

🔬 MRR Benchmark Results

📊 Overall Performance (5,000 Multi-Random Retrieval Scenarios - Sample Dataset of 500 Available)

| Metric | Chronos | GPT-4.1 | Claude 4.1 Opus | Gemini 2.0 Pro | Improvement | |:-------|:-----------:|:-------:|:---------------:|:--------------:|:-----------:| | Debug Success Rate | 67.3% ± 2.1% | 13.8% | 14.2% | 15.0% | 4.5x | | Root Cause Accuracy | 89%* | 12.3% ± 1.8% | 11.7% ± 2.0% | 15.8% ± 1.5% | 5.6-7.6x | | Retrieval Precision | 92%* | 68% ± 2.3% | 67% ± 2.4% | 74% ± 1.8% | 1.2-1.4x | | Retrieval Recall | 85% | 32% ± 2.1% | 34% ± 2.0% | 42% ± 1.9% | 2.0-2.7x | | Avg Fix Iterations | 7.8 | 1-2 | 1-2 | 1-2 | More thorough | | Time Reduction | 40% | - | - | - | 40% faster |

p < 0.001 compared to best baseline (two-tailed t-test, n=5,000) • Sample dataset (n=500) available now, full benchmark Q1 2026

</div>

🐛 Performance by Bug Category

| Bug Category | Chronos | GPT-4.1 | Claude 4.1 Opus | Gemini 2.0 Pro | Chronos Advantage | |:-------------|:-------:|:-------:|:---------------:|:--------------:|:-----------------:| | Syntax Errors | 94.2% | 82.3% | 79.8% | 85.1% | 1.1x | | Logic Bugs | 72.8% | 12.1% | 10.7% | 15.3% | 6.0x | | Concurrency Issues | 58.3% | 3.2% | 2.8% | 4.1% | 18.2x | | Memory Problems | 61.7% | 5.7% | 4.3% | 6.9% | 10.8x | | API Misuse | 79.1% | 18.9% | 16.2% | 22.4% | 4.2x | | Performance Bugs | 65.4% | 7.4% | 6.1% | 9.8% | 8.8x |

</div>

📏 Repository Scale Performance

| Repository Size | Chronos Success | Best Baseline | Baseline Model | Improvement | |:---------------:|:---------------:|:-------------:|:--------------:|:-----------:| | <10K LOC | 71.2% ± 2.8% | 21.3% ± 3.5% | Gemini 2.0 Pro | 3.3x | | 10K-100K LOC | 68.9% ± 2.5% | 14.7% ± 3.2% | Gemini 2.0 Pro | 4.7x | | 100K-1M LOC | 64.3% ± 2.9% | 8.9% ± 2.8% | Gemini 2.0 Pro | 7.2x | | >1M LOC | 59.7% ± 3.1% | 3.8% ± 1.9% | Gemini 2.0 Pro | 15.7x |

</div>

💡 Key Innovations

1. Debugging-First Architecture

Trained on 42.5M real debugging examples (not code completion)
Specialized for root cause analysis and multi-file patches
89% root cause accuracy vs 15.8% best baseline
7-layer architecture optimized for debugging workflows

2. Persistent Debug Memory (PDM)

Repository-specific learning from 15M+ debugging sessions
Improves from 35% → 65% success rate over time
Cross-session pattern recognition and learning
87% cache hit rate for similar bugs
Temporal pattern learning across project lifecycles

3. Adaptive Graph-Guided Retrieval (AGR)

O(k log d) complexity with dynamic k-hop expansion
92% precision, 85% recall on multi-file context
Handles unlimited repository scale intelligently
Multi-hop traversal with confidence-based termination
3.8x faster than traditional retrieval methods

4. Output-Optimized Design

Optimized for ~3K output tokens (fixes, tests, docs)
47.2% output entropy density vs 12.8% for completion models
Designed for complex patch generation
Template-aware generation for consistency
Confidence-guided output strategy

5. Autonomous Debugging Loop

Average 7.8 iterations to successful fix
Propose → Test → Analyze → Refine cycles
67.3% fully autonomous success rate
Execution sandbox with real-time feedback
Iterative refinement until validation succeeds

🏗️ Architecture

Seven-Layer System Design

┌─────────────────────────────────────────────┐
│   7. Explainability Layer                   │  Human-readable root cause analysis
├─────────────────────────────────────────────┤
│   6. Execution Sandbox                      │  Isolated test validation
├─────────────────────────────────────────────┤
│   5. Persistent Debug Memory (PDM)          │  Repository-specific learning
├─────────────────────────────────────────────┤
│   4. Orchestration Controller               │  Autonomous debugging loop
├─────────────────────────────────────────────┤
│   3. Debug-Tuned LLM Core                   │  42.5M debugging examples
├─────────────────────────────────────────────┤
│   2. Adaptive Retrieval Engine (AGR)        │  Dynamic k-hop graph traversal
├─────────────────────────────────────────────┤
│   1. Multi-Source Input Layer               │  Code, logs, traces, tests, docs
└─────────────────────────────────────────────┘

Layer Descriptions

Multi-Source Input Layer: Processes code, logs, traces, tests, docs simultaneously
**Adaptive Retrieval E

Chronos

Install / Use

README

Kodezi Chronos

The World's First Debugging-First Language Model for Repository-Scale Code Understanding

Performance Badges

Key Achievements

Table of Contents

Model Access Notice

🏅 State-of-the-Art Results

📈 SWE-bench Lite Performance

The Debugging Gap

Repository-Specific Results

🔬 MRR Benchmark Results

📊 Overall Performance (5,000 Multi-Random Retrieval Scenarios - Sample Dataset of 500 Available)

🐛 Performance by Bug Category

📏 Repository Scale Performance

💡 Key Innovations

1. Debugging-First Architecture

2. Persistent Debug Memory (PDM)

3. Adaptive Graph-Guided Retrieval (AGR)

4. Output-Optimized Design

5. Autonomous Debugging Loop

🏗️ Architecture

Seven-Layer System Design

Layer Descriptions