SkillAgentSearch skills...

Chronos

Kodezi Chronos is a debugging-first language model that achieves state-of-the-art results on SWE-bench Lite (80.33%) and 67% real-world fix accuracy, over six times better than GPT-4. Built with Adaptive Graph-Guided Retrieval and Persistent Debug Memory. Model available Q1 2026 via Kodezi OS.

Install / Use

/learn @Kodezi/Chronos

README

<div align="center">

Kodezi Chronos

<p align="center"> <img src="results/figures/chronos_hero.png" alt="Introducing Kodezi Chronos-1" width="100%"> </p>

The World's First Debugging-First Language Model for Repository-Scale Code Understanding

arXiv Model Access Research Benchmark Leaderboard

Performance Badges

<img src="https://img.shields.io/badge/SWE--bench%20Lite-80.33%25-gold?style=for-the-badge" alt="SWE-bench Lite"> <img src="https://img.shields.io/badge/Debug%20Success-67.3%25-brightgreen?style=for-the-badge" alt="Debug Success Rate"> <img src="https://img.shields.io/badge/Human%20Preference-89%25-blue?style=for-the-badge" alt="Human Preference"> <img src="https://img.shields.io/badge/Improvement-4--5x-yellow?style=for-the-badge" alt="Improvement over GPT-4.1"> <img src="https://img.shields.io/badge/Time%20Reduction-40%25-orange?style=for-the-badge" alt="Time Reduction">

Key Achievements

80.33% SWE-bench Lite67.3% Autonomous Debugging89% Human Preference40% Time Reduction

<p align="center"> <img src="results/figures/architecture_overview.svg" alt="Chronos Architecture" width="800"> </p> </div>

Table of Contents


Model Access Notice

<div align="center">

Chronos is proprietary and available exclusively through Kodezi OS

| Timeline | Access | Details | |:--------:|:------:|:-------:| | Q4 2025 | Beta | Limited enterprise access | | Q1 2026 | GA | Via Kodezi OS |

This repository contains research paper, benchmarks, and evaluation results only.

Get Early AccessRead PaperView LeaderboardDocumentation

</div>

🏅 State-of-the-Art Results

📈 SWE-bench Lite Performance

<div align="center">

Industry-Standard Benchmark Results

| Rank | System | Success Rate | Instances | Lead | Year | |:----:|:-------|:------------:|:---------:|:----:|:----:| | 1 | Kodezi Chronos | 80.33% | 241/300 | +20.0pp | 2025 | | 2 | ExpeRepair-v1.0 + Claude 4.5 Sonnet | 60.33% | 181/300 | - | 2025 | | 3 | Claude 4.5 Sonnet (Bash Only) | ~14% | ~42/300 | -66.3pp | 2025 | | 4 | Claude 4.1 Opus (Bash Only) | 14.2% | 43/300 | -66.1pp | 2025 | | 5 | GPT-4.1 | 13.8% | 41/300 | -66.5pp | 2025 | | 6 | Gemini 2.0 Pro | 13.4% | 40/300 | -67.0pp | 2025 |

20 percentage point absolute lead over second place

</div>

The Debugging Gap

<div align="center">

General-Purpose Models: Code Generation vs Debugging Performance

| Model | SWE-bench Full<br/>(Code Gen) | SWE-bench Lite<br/>(Debugging) | Performance Gap | |:------|:-----------------------------:|:------------------------------:|:---------------:| | Claude 4.5 Sonnet | 72.7% | ~14% | -58.7pp | | Claude 4.1 Opus | 72.5% | 14.2% | -58.3pp | | Claude 4.1 Opus (Bash) | 67.60% | 14.2% | -53.4pp | | GPT-4.1 | 54.6% | 13.8% | -40.8pp | | Kodezi Chronos | N/A | 80.33% | Specialized |

Key Insight: Even models achieving 70%+ on code generation drop to <15% on debugging tasks, revealing a 50+ percentage point gap. Chronos, purpose-built for debugging, achieves 80.33%—demonstrating that debugging requires specialized architectures, not just larger context windows.

</div>

Repository-Specific Results

<div align="center">

SWE-bench Lite: Domain-Specific Performance

| Repository | Domain | Chronos Success | Instances | Significance | |:-----------|:-------|:---------------:|:---------:|:-------------| | sympy | Symbolic Mathematics | 96.1% | 51/53 | Near-perfect mathematical reasoning | | sphinx | Documentation Systems | 93.8% | 60/64 | Exceptional doc generation bugs | | django | Web Frameworks | 90.4% | 104/115 | Complex framework debugging | | Overall | Mixed Domains | 80.33% | 241/300 | State-of-the-art |

</div>

🔬 MRR Benchmark Results

<div align="center">

📊 Overall Performance (5,000 Multi-Random Retrieval Scenarios - Sample Dataset of 500 Available)

| Metric | Chronos | GPT-4.1 | Claude 4.1 Opus | Gemini 2.0 Pro | Improvement | |:-------|:-----------:|:-------:|:---------------:|:--------------:|:-----------:| | Debug Success Rate | 67.3% ± 2.1% | 13.8% | 14.2% | 15.0% | 4.5x | | Root Cause Accuracy | 89%* | 12.3% ± 1.8% | 11.7% ± 2.0% | 15.8% ± 1.5% | 5.6-7.6x | | Retrieval Precision | 92%* | 68% ± 2.3% | 67% ± 2.4% | 74% ± 1.8% | 1.2-1.4x | | Retrieval Recall | 85% | 32% ± 2.1% | 34% ± 2.0% | 42% ± 1.9% | 2.0-2.7x | | Avg Fix Iterations | 7.8 | 1-2 | 1-2 | 1-2 | More thorough | | Time Reduction | 40% | - | - | - | 40% faster |

p < 0.001 compared to best baseline (two-tailed t-test, n=5,000) • Sample dataset (n=500) available now, full benchmark Q1 2026

</div>

🐛 Performance by Bug Category

<div align="center">

| Bug Category | Chronos | GPT-4.1 | Claude 4.1 Opus | Gemini 2.0 Pro | Chronos Advantage | |:-------------|:-------:|:-------:|:---------------:|:--------------:|:-----------------:| | Syntax Errors | 94.2% | 82.3% | 79.8% | 85.1% | 1.1x | | Logic Bugs | 72.8% | 12.1% | 10.7% | 15.3% | 6.0x | | Concurrency Issues | 58.3% | 3.2% | 2.8% | 4.1% | 18.2x | | Memory Problems | 61.7% | 5.7% | 4.3% | 6.9% | 10.8x | | API Misuse | 79.1% | 18.9% | 16.2% | 22.4% | 4.2x | | Performance Bugs | 65.4% | 7.4% | 6.1% | 9.8% | 8.8x |

</div>

📏 Repository Scale Performance

<div align="center">

| Repository Size | Chronos Success | Best Baseline | Baseline Model | Improvement | |:---------------:|:---------------:|:-------------:|:--------------:|:-----------:| | <10K LOC | 71.2% ± 2.8% | 21.3% ± 3.5% | Gemini 2.0 Pro | 3.3x | | 10K-100K LOC | 68.9% ± 2.5% | 14.7% ± 3.2% | Gemini 2.0 Pro | 4.7x | | 100K-1M LOC | 64.3% ± 2.9% | 8.9% ± 2.8% | Gemini 2.0 Pro | 7.2x | | >1M LOC | 59.7% ± 3.1% | 3.8% ± 1.9% | Gemini 2.0 Pro | 15.7x |

</div>

💡 Key Innovations

1. Debugging-First Architecture

  • Trained on 42.5M real debugging examples (not code completion)
  • Specialized for root cause analysis and multi-file patches
  • 89% root cause accuracy vs 15.8% best baseline
  • 7-layer architecture optimized for debugging workflows

2. Persistent Debug Memory (PDM)

  • Repository-specific learning from 15M+ debugging sessions
  • Improves from 35% → 65% success rate over time
  • Cross-session pattern recognition and learning
  • 87% cache hit rate for similar bugs
  • Temporal pattern learning across project lifecycles

3. Adaptive Graph-Guided Retrieval (AGR)

  • O(k log d) complexity with dynamic k-hop expansion
  • 92% precision, 85% recall on multi-file context
  • Handles unlimited repository scale intelligently
  • Multi-hop traversal with confidence-based termination
  • 3.8x faster than traditional retrieval methods

4. Output-Optimized Design

  • Optimized for ~3K output tokens (fixes, tests, docs)
  • 47.2% output entropy density vs 12.8% for completion models
  • Designed for complex patch generation
  • Template-aware generation for consistency
  • Confidence-guided output strategy

5. Autonomous Debugging Loop

  • Average 7.8 iterations to successful fix
  • Propose → Test → Analyze → Refine cycles
  • 67.3% fully autonomous success rate
  • Execution sandbox with real-time feedback
  • Iterative refinement until validation succeeds

🏗️ Architecture

Seven-Layer System Design

┌─────────────────────────────────────────────┐
│   7. Explainability Layer                   │  Human-readable root cause analysis
├─────────────────────────────────────────────┤
│   6. Execution Sandbox                      │  Isolated test validation
├─────────────────────────────────────────────┤
│   5. Persistent Debug Memory (PDM)          │  Repository-specific learning
├─────────────────────────────────────────────┤
│   4. Orchestration Controller               │  Autonomous debugging loop
├─────────────────────────────────────────────┤
│   3. Debug-Tuned LLM Core                   │  42.5M debugging examples
├─────────────────────────────────────────────┤
│   2. Adaptive Retrieval Engine (AGR)        │  Dynamic k-hop graph traversal
├─────────────────────────────────────────────┤
│   1. Multi-Source Input Layer               │  Code, logs, traces, tests, docs
└─────────────────────────────────────────────┘

Layer Descriptions

  1. Multi-Source Input Layer: Processes code, logs, traces, tests, docs simultaneously
  2. **Adaptive Retrieval E
View on GitHub
GitHub Stars5.1k
CategoryDevelopment
Updated6h ago
Forks215

Languages

Java

Security Score

85/100

Audited on Mar 21, 2026

No findings