Chronos
Kodezi Chronos is a debugging-first language model that achieves state-of-the-art results on SWE-bench Lite (80.33%) and 67% real-world fix accuracy, over six times better than GPT-4. Built with Adaptive Graph-Guided Retrieval and Persistent Debug Memory. Model available Q1 2026 via Kodezi OS.
Install / Use
/learn @Kodezi/ChronosREADME
Kodezi Chronos
<p align="center"> <img src="results/figures/chronos_hero.png" alt="Introducing Kodezi Chronos-1" width="100%"> </p>The World's First Debugging-First Language Model for Repository-Scale Code Understanding
Performance Badges
<img src="https://img.shields.io/badge/SWE--bench%20Lite-80.33%25-gold?style=for-the-badge" alt="SWE-bench Lite"> <img src="https://img.shields.io/badge/Debug%20Success-67.3%25-brightgreen?style=for-the-badge" alt="Debug Success Rate"> <img src="https://img.shields.io/badge/Human%20Preference-89%25-blue?style=for-the-badge" alt="Human Preference"> <img src="https://img.shields.io/badge/Improvement-4--5x-yellow?style=for-the-badge" alt="Improvement over GPT-4.1"> <img src="https://img.shields.io/badge/Time%20Reduction-40%25-orange?style=for-the-badge" alt="Time Reduction">Key Achievements
80.33% SWE-bench Lite • 67.3% Autonomous Debugging • 89% Human Preference • 40% Time Reduction
<p align="center"> <img src="results/figures/architecture_overview.svg" alt="Chronos Architecture" width="800"> </p> </div>Table of Contents
- State-of-the-Art Results
- MRR Benchmark Results
- Key Innovations
- Architecture
- Benchmarks & Evaluation
- Research Paper
- Getting Started
- Repository Structure
- Research Highlights
- Detailed Performance
- Documentation
- Contributing
- Citation
- License
Model Access Notice
<div align="center">Chronos is proprietary and available exclusively through Kodezi OS
| Timeline | Access | Details | |:--------:|:------:|:-------:| | Q4 2025 | Beta | Limited enterprise access | | Q1 2026 | GA | Via Kodezi OS |
This repository contains research paper, benchmarks, and evaluation results only.
Get Early Access • Read Paper • View Leaderboard • Documentation
</div>🏅 State-of-the-Art Results
📈 SWE-bench Lite Performance
<div align="center">Industry-Standard Benchmark Results
| Rank | System | Success Rate | Instances | Lead | Year | |:----:|:-------|:------------:|:---------:|:----:|:----:| | 1 | Kodezi Chronos | 80.33% | 241/300 | +20.0pp | 2025 | | 2 | ExpeRepair-v1.0 + Claude 4.5 Sonnet | 60.33% | 181/300 | - | 2025 | | 3 | Claude 4.5 Sonnet (Bash Only) | ~14% | ~42/300 | -66.3pp | 2025 | | 4 | Claude 4.1 Opus (Bash Only) | 14.2% | 43/300 | -66.1pp | 2025 | | 5 | GPT-4.1 | 13.8% | 41/300 | -66.5pp | 2025 | | 6 | Gemini 2.0 Pro | 13.4% | 40/300 | -67.0pp | 2025 |
20 percentage point absolute lead over second place
</div>The Debugging Gap
<div align="center">General-Purpose Models: Code Generation vs Debugging Performance
| Model | SWE-bench Full<br/>(Code Gen) | SWE-bench Lite<br/>(Debugging) | Performance Gap | |:------|:-----------------------------:|:------------------------------:|:---------------:| | Claude 4.5 Sonnet | 72.7% | ~14% | -58.7pp | | Claude 4.1 Opus | 72.5% | 14.2% | -58.3pp | | Claude 4.1 Opus (Bash) | 67.60% | 14.2% | -53.4pp | | GPT-4.1 | 54.6% | 13.8% | -40.8pp | | Kodezi Chronos | N/A | 80.33% | Specialized |
Key Insight: Even models achieving 70%+ on code generation drop to <15% on debugging tasks, revealing a 50+ percentage point gap. Chronos, purpose-built for debugging, achieves 80.33%—demonstrating that debugging requires specialized architectures, not just larger context windows.
</div>Repository-Specific Results
<div align="center">SWE-bench Lite: Domain-Specific Performance
| Repository | Domain | Chronos Success | Instances | Significance | |:-----------|:-------|:---------------:|:---------:|:-------------| | sympy | Symbolic Mathematics | 96.1% | 51/53 | Near-perfect mathematical reasoning | | sphinx | Documentation Systems | 93.8% | 60/64 | Exceptional doc generation bugs | | django | Web Frameworks | 90.4% | 104/115 | Complex framework debugging | | Overall | Mixed Domains | 80.33% | 241/300 | State-of-the-art |
</div>🔬 MRR Benchmark Results
<div align="center">📊 Overall Performance (5,000 Multi-Random Retrieval Scenarios - Sample Dataset of 500 Available)
| Metric | Chronos | GPT-4.1 | Claude 4.1 Opus | Gemini 2.0 Pro | Improvement | |:-------|:-----------:|:-------:|:---------------:|:--------------:|:-----------:| | Debug Success Rate | 67.3% ± 2.1% | 13.8% | 14.2% | 15.0% | 4.5x | | Root Cause Accuracy | 89%* | 12.3% ± 1.8% | 11.7% ± 2.0% | 15.8% ± 1.5% | 5.6-7.6x | | Retrieval Precision | 92%* | 68% ± 2.3% | 67% ± 2.4% | 74% ± 1.8% | 1.2-1.4x | | Retrieval Recall | 85% | 32% ± 2.1% | 34% ± 2.0% | 42% ± 1.9% | 2.0-2.7x | | Avg Fix Iterations | 7.8 | 1-2 | 1-2 | 1-2 | More thorough | | Time Reduction | 40% | - | - | - | 40% faster |
p < 0.001 compared to best baseline (two-tailed t-test, n=5,000) • Sample dataset (n=500) available now, full benchmark Q1 2026
</div>🐛 Performance by Bug Category
<div align="center">| Bug Category | Chronos | GPT-4.1 | Claude 4.1 Opus | Gemini 2.0 Pro | Chronos Advantage | |:-------------|:-------:|:-------:|:---------------:|:--------------:|:-----------------:| | Syntax Errors | 94.2% | 82.3% | 79.8% | 85.1% | 1.1x | | Logic Bugs | 72.8% | 12.1% | 10.7% | 15.3% | 6.0x | | Concurrency Issues | 58.3% | 3.2% | 2.8% | 4.1% | 18.2x | | Memory Problems | 61.7% | 5.7% | 4.3% | 6.9% | 10.8x | | API Misuse | 79.1% | 18.9% | 16.2% | 22.4% | 4.2x | | Performance Bugs | 65.4% | 7.4% | 6.1% | 9.8% | 8.8x |
</div>📏 Repository Scale Performance
<div align="center">| Repository Size | Chronos Success | Best Baseline | Baseline Model | Improvement | |:---------------:|:---------------:|:-------------:|:--------------:|:-----------:| | <10K LOC | 71.2% ± 2.8% | 21.3% ± 3.5% | Gemini 2.0 Pro | 3.3x | | 10K-100K LOC | 68.9% ± 2.5% | 14.7% ± 3.2% | Gemini 2.0 Pro | 4.7x | | 100K-1M LOC | 64.3% ± 2.9% | 8.9% ± 2.8% | Gemini 2.0 Pro | 7.2x | | >1M LOC | 59.7% ± 3.1% | 3.8% ± 1.9% | Gemini 2.0 Pro | 15.7x |
</div>💡 Key Innovations
1. Debugging-First Architecture
- Trained on 42.5M real debugging examples (not code completion)
- Specialized for root cause analysis and multi-file patches
- 89% root cause accuracy vs 15.8% best baseline
- 7-layer architecture optimized for debugging workflows
2. Persistent Debug Memory (PDM)
- Repository-specific learning from 15M+ debugging sessions
- Improves from 35% → 65% success rate over time
- Cross-session pattern recognition and learning
- 87% cache hit rate for similar bugs
- Temporal pattern learning across project lifecycles
3. Adaptive Graph-Guided Retrieval (AGR)
- O(k log d) complexity with dynamic k-hop expansion
- 92% precision, 85% recall on multi-file context
- Handles unlimited repository scale intelligently
- Multi-hop traversal with confidence-based termination
- 3.8x faster than traditional retrieval methods
4. Output-Optimized Design
- Optimized for ~3K output tokens (fixes, tests, docs)
- 47.2% output entropy density vs 12.8% for completion models
- Designed for complex patch generation
- Template-aware generation for consistency
- Confidence-guided output strategy
5. Autonomous Debugging Loop
- Average 7.8 iterations to successful fix
- Propose → Test → Analyze → Refine cycles
- 67.3% fully autonomous success rate
- Execution sandbox with real-time feedback
- Iterative refinement until validation succeeds
🏗️ Architecture
Seven-Layer System Design
┌─────────────────────────────────────────────┐
│ 7. Explainability Layer │ Human-readable root cause analysis
├─────────────────────────────────────────────┤
│ 6. Execution Sandbox │ Isolated test validation
├─────────────────────────────────────────────┤
│ 5. Persistent Debug Memory (PDM) │ Repository-specific learning
├─────────────────────────────────────────────┤
│ 4. Orchestration Controller │ Autonomous debugging loop
├─────────────────────────────────────────────┤
│ 3. Debug-Tuned LLM Core │ 42.5M debugging examples
├─────────────────────────────────────────────┤
│ 2. Adaptive Retrieval Engine (AGR) │ Dynamic k-hop graph traversal
├─────────────────────────────────────────────┤
│ 1. Multi-Source Input Layer │ Code, logs, traces, tests, docs
└─────────────────────────────────────────────┘
Layer Descriptions
- Multi-Source Input Layer: Processes code, logs, traces, tests, docs simultaneously
- **Adaptive Retrieval E
