Tachyon

English Doc (Current) | 中文文档

AI empowered CUDA kernel profiler

Tachyon /ˈtakēˌän/ (tachyon — a theoretical particle that travels faster than light) is a CUDA kernel performance analysis and optimization toolkit. It combines LLM-powered agents with traditional rule-based analysis to bridge the full path from NCU metrics to CUDA source code to low-level instructions (PTX/SASS), so performance analysis doesn't stop at aggregate counters — it traces back through the instruction level all the way to your source lines. Beyond analysis, the evolve mode automates the optimization loop: an agent reads profiling data, edits source code, rebuilds, re-profiles, and iterates until convergence — no manual tuning required.

Fully integrated in Python. Supports end-to-end profiling (run an executable directly after tachyon, like ncu), interactive AI-driven analysis, and fully automated iterative optimization. Multiple LLM agent vendors are supported.

Features

Three-way mapping: NCU metrics <-> source lines <-> SASS instructions, pinpointing the exact code behind each bottleneck.
Smart two-stage profiling: quick scan finds the hottest top-K kernels, then deep dive collects detailed metrics only where it matters.
Rule engine + AI agent: 7 built-in analyzers (roofline, memory, occupancy, warp stall, ...) produce structured findings; an LLM agent with 9 specialized tools supports interactive follow-up.
Profile diff: compare two .ncu-rep files side by side, highlight regressions.
Evolve mode: automated iterative optimization — LLM agent reads NCU profiling data, edits CUDA source, compiles, re-profiles, and accepts/rolls back per iteration. No manual tuning loops.
MCP server: expose all analysis tools over MCP (stdio transport) for seamless integration with Claude Code, Ducc, Cursor, and custom agents.

Quick Start

Installation

# Rule-based analysis only (no LLM needed)
pip install -e .

# With AI agent (OpenAI / Anthropic / LiteLLM)
pip install -e ".[ai]"

# With MCP server
pip install -e ".[mcp]"

# Full development environment
pip install -e ".[dev,ai,mcp]"

Basic Usage

tachyon analyze report.ncu-rep
tachyon chat report.ncu-rep --model claude-sonnet-4-20250514
tachyon profile ./my_app --strategy radical
tachyon diff before.ncu-rep after.ncu-rep
tachyon evolve ./my_app --build "make -j8" --max-iterations 10
tachyon serve --mcp --report report.ncu-rep

CLI Commands

| Command | Description | |---------|-------------| | tachyon analyze | Parse .ncu-rep, run rule-based analyzers, output findings. | | tachyon chat | Interactive AI analysis with 9 specialized tools. | | tachyon profile | End-to-end: profile a CUDA executable, then analyze. | | tachyon diff | Compare two reports, flag performance changes. | | tachyon evolve | Automated iterative kernel optimization via LLM agent. | | tachyon serve | Start MCP server for external agents. |

tachyon <command> --help for full options. See CLI Reference.

Multi-Vendor LLM Switching

# ~/.tachyon/config.toml
[llm]
provider = "anthropic"          # openai / anthropic / litellm
model = "claude-sonnet-4-20250514"
api_key_env = "ANTHROPIC_API_KEY"
# base_url = "http://localhost:8000/v1"  # local models

# Switch via environment variables
export TACHYON_LLM_PROVIDER=litellm
export TACHYON_MODEL=qianfan/ERNIE-Bot-4

# Single-session CLI override
tachyon chat report.ncu-rep --provider openai --model gpt-4o

Priority: CLI > env vars > config.toml > defaults. Graceful degradation to Rule-Only mode when no LLM is available.

Ducc Integration

// .claude/settings.json
{
  "mcpServers": {
    "tachyon": {
      "command": "tachyon",
      "args": ["serve", "--mcp", "--report", "./report.ncu-rep"]
    }
  }
}

Once Ducc starts, all 9 tools are auto-registered: list_kernels, get_kernel_metrics, get_source_hotspots, run_analysis, etc.

SDK Usage

Tachyon also works as a Python library:

from tachyon import NcuProfiler, ToolPathResolver, TachyonConfig

config = TachyonConfig.load()
resolver = ToolPathResolver(config)
profiler = NcuProfiler(config, resolver)
result = profiler.profile_basic("./my_cuda_app")

See SDK Guide for detailed examples.

Configuration

~/.tachyon/config.toml, with environment variable and CLI overrides.

[llm]
provider = "anthropic"
model = "claude-sonnet-4-20250514"
temperature = 0.1

[profiling]
strategy = "conservative"

[output]
lang = "en"
format = "terminal"

[tools]
ncu_path = "/usr/local/cuda/bin/ncu"

See Configuration for all options.

Architecture (8 Layers)

CLI/Chat > Agent Loop > LLM Backends > Tools(9) > Analyzers(7) > Correlator > Reader > Models

Data flows bottom-up: .ncu-rep → Reader parses into KernelReport → Correlator builds three-way mapping → Analyzers produce Findings → CLI renders / Agent explores interactively.

See Architecture for the full overview.

Development

Prerequisites

Python 3.10+
NVIDIA Nsight Compute (for profiling and report parsing)

Tests

pip install -e ".[dev,ai]"
pytest
pytest --cov=tachyon --cov-report=term-missing
mypy src/tachyon/
ruff check src/

Project Structure

src/tachyon/
├── cli/           Click commands (analyze, chat, profile, diff, serve)
├── agent/         Multi-turn agent loop, persona, context manager
├── llm/           LLM backend abstraction (OpenAI, Anthropic, LiteLLM)
├── tools/         9 agent-callable tools (data query, source, analysis)
├── analyzers/     7 rule-based analyzers (roofline, memory, occupancy, ...)
├── correlator/    Three-way mapping engine (metrics <-> source <-> SASS)
├── reader/        NCU .ncu-rep binary report parser
├── profiler/      Two-stage smart profiling, tool path resolver
├── evolve/        Iterative optimization orchestrator, tools, persona
├── models/        Core data models (KernelReport, Finding, OptTree, ...)
├── config/        TOML configuration with layered overrides
├── report/        Output renderers (terminal, markdown)
├── diff/          Profile comparison engine
├── server/        MCP server (stdio transport)
├── tree/          Optimization tree builder and pruning
├── errors/        Structured error handling (ToolResult, ErrorCode)
└── i18n/          Internationalization support

Metrics

Documentation

| Document | Description | |----------|-------------| | Architecture | 8-layer architecture, data flow, module responsibilities, design decisions | | CLI Reference | Full options and examples for all 5 commands | | Configuration | config.toml settings, env vars, priority chain | | SDK Guide | Python SDK patterns, common usage, API reference | | Multi-Vendor LLM & Ducc Integration | LLM switching, Ducc MCP integration, custom providers | | Evolve Guide | Automated iterative optimization: usage, config, real-world examples |

Preliminary Showcasing [WIP]

For this part, sees: Showcasing

License

MIT

Tachyon

Install / Use

README

Tachyon

Features

Quick Start

Installation

Basic Usage

CLI Commands

Multi-Vendor LLM Switching

Ducc Integration

SDK Usage

Configuration

Architecture (8 Layers)

Development

Prerequisites

Tests

Project Structure

Metrics

Documentation

Preliminary Showcasing [WIP]

License

Related Skills