Tachyon
Tachyon: AI-empowered CUDA kernel profiler and self-evolving tool with end-to-end or post profiling (NCU-report) analysis. Metrics & PTX/SASS & Source code backtracing supported!
Install / Use
/learn @Enigmatisms/TachyonREADME
Tachyon
English Doc (Current) | 中文文档
AI empowered CUDA kernel profiler
Tachyon /ˈtakēˌän/ (tachyon — a theoretical particle that travels faster than light) is a CUDA kernel performance analysis and optimization toolkit. It combines LLM-powered agents with traditional rule-based analysis to bridge the full path from NCU metrics to CUDA source code to low-level instructions (PTX/SASS), so performance analysis doesn't stop at aggregate counters — it traces back through the instruction level all the way to your source lines. Beyond analysis, the evolve mode automates the optimization loop: an agent reads profiling data, edits source code, rebuilds, re-profiles, and iterates until convergence — no manual tuning required.
Fully integrated in Python. Supports end-to-end profiling (run an executable directly after tachyon, like ncu), interactive AI-driven analysis, and fully automated iterative optimization. Multiple LLM agent vendors are supported.
Features
- Three-way mapping: NCU metrics <-> source lines <-> SASS instructions, pinpointing the exact code behind each bottleneck.
- Smart two-stage profiling: quick scan finds the hottest top-K kernels, then deep dive collects detailed metrics only where it matters.
- Rule engine + AI agent: 7 built-in analyzers (roofline, memory, occupancy, warp stall, ...) produce structured findings; an LLM agent with 9 specialized tools supports interactive follow-up.
- Profile diff: compare two
.ncu-repfiles side by side, highlight regressions. - Evolve mode: automated iterative optimization — LLM agent reads NCU profiling data, edits CUDA source, compiles, re-profiles, and accepts/rolls back per iteration. No manual tuning loops.
- MCP server: expose all analysis tools over MCP (stdio transport) for seamless integration with Claude Code, Ducc, Cursor, and custom agents.
Quick Start
Installation
# Rule-based analysis only (no LLM needed)
pip install -e .
# With AI agent (OpenAI / Anthropic / LiteLLM)
pip install -e ".[ai]"
# With MCP server
pip install -e ".[mcp]"
# Full development environment
pip install -e ".[dev,ai,mcp]"
Basic Usage
tachyon analyze report.ncu-rep
tachyon chat report.ncu-rep --model claude-sonnet-4-20250514
tachyon profile ./my_app --strategy radical
tachyon diff before.ncu-rep after.ncu-rep
tachyon evolve ./my_app --build "make -j8" --max-iterations 10
tachyon serve --mcp --report report.ncu-rep
CLI Commands
| Command | Description |
|---------|-------------|
| tachyon analyze | Parse .ncu-rep, run rule-based analyzers, output findings. |
| tachyon chat | Interactive AI analysis with 9 specialized tools. |
| tachyon profile | End-to-end: profile a CUDA executable, then analyze. |
| tachyon diff | Compare two reports, flag performance changes. |
| tachyon evolve | Automated iterative kernel optimization via LLM agent. |
| tachyon serve | Start MCP server for external agents. |
tachyon <command> --help for full options. See CLI Reference.
Multi-Vendor LLM Switching
# ~/.tachyon/config.toml
[llm]
provider = "anthropic" # openai / anthropic / litellm
model = "claude-sonnet-4-20250514"
api_key_env = "ANTHROPIC_API_KEY"
# base_url = "http://localhost:8000/v1" # local models
# Switch via environment variables
export TACHYON_LLM_PROVIDER=litellm
export TACHYON_MODEL=qianfan/ERNIE-Bot-4
# Single-session CLI override
tachyon chat report.ncu-rep --provider openai --model gpt-4o
Priority: CLI > env vars > config.toml > defaults. Graceful degradation to Rule-Only mode when no LLM is available.
Ducc Integration
// .claude/settings.json
{
"mcpServers": {
"tachyon": {
"command": "tachyon",
"args": ["serve", "--mcp", "--report", "./report.ncu-rep"]
}
}
}
Once Ducc starts, all 9 tools are auto-registered: list_kernels, get_kernel_metrics, get_source_hotspots, run_analysis, etc.
SDK Usage
Tachyon also works as a Python library:
from tachyon import NcuProfiler, ToolPathResolver, TachyonConfig
config = TachyonConfig.load()
resolver = ToolPathResolver(config)
profiler = NcuProfiler(config, resolver)
result = profiler.profile_basic("./my_cuda_app")
See SDK Guide for detailed examples.
Configuration
~/.tachyon/config.toml, with environment variable and CLI overrides.
[llm]
provider = "anthropic"
model = "claude-sonnet-4-20250514"
temperature = 0.1
[profiling]
strategy = "conservative"
[output]
lang = "en"
format = "terminal"
[tools]
ncu_path = "/usr/local/cuda/bin/ncu"
See Configuration for all options.
Architecture (8 Layers)
CLI/Chat > Agent Loop > LLM Backends > Tools(9) > Analyzers(7) > Correlator > Reader > Models
Data flows bottom-up: .ncu-rep → Reader parses into KernelReport → Correlator builds three-way mapping → Analyzers produce Findings → CLI renders / Agent explores interactively.
See Architecture for the full overview.
Development
Prerequisites
- Python 3.10+
- NVIDIA Nsight Compute (for profiling and report parsing)
Tests
pip install -e ".[dev,ai]"
pytest
pytest --cov=tachyon --cov-report=term-missing
mypy src/tachyon/
ruff check src/
Project Structure
src/tachyon/
├── cli/ Click commands (analyze, chat, profile, diff, serve)
├── agent/ Multi-turn agent loop, persona, context manager
├── llm/ LLM backend abstraction (OpenAI, Anthropic, LiteLLM)
├── tools/ 9 agent-callable tools (data query, source, analysis)
├── analyzers/ 7 rule-based analyzers (roofline, memory, occupancy, ...)
├── correlator/ Three-way mapping engine (metrics <-> source <-> SASS)
├── reader/ NCU .ncu-rep binary report parser
├── profiler/ Two-stage smart profiling, tool path resolver
├── evolve/ Iterative optimization orchestrator, tools, persona
├── models/ Core data models (KernelReport, Finding, OptTree, ...)
├── config/ TOML configuration with layered overrides
├── report/ Output renderers (terminal, markdown)
├── diff/ Profile comparison engine
├── server/ MCP server (stdio transport)
├── tree/ Optimization tree builder and pruning
├── errors/ Structured error handling (ToolResult, ErrorCode)
└── i18n/ Internationalization support
Metrics
733 tests | 88% coverage | 0 lint errors | 7 analyzers | 9 tools | 3 LLM backends
Documentation
| Document | Description | |----------|-------------| | Architecture | 8-layer architecture, data flow, module responsibilities, design decisions | | CLI Reference | Full options and examples for all 5 commands | | Configuration | config.toml settings, env vars, priority chain | | SDK Guide | Python SDK patterns, common usage, API reference | | Multi-Vendor LLM & Ducc Integration | LLM switching, Ducc MCP integration, custom providers | | Evolve Guide | Automated iterative optimization: usage, config, real-world examples |
Preliminary Showcasing [WIP]
For this part, sees: Showcasing
License
MIT
Related Skills
openhue
354.3kControl Philips Hue lights and scenes via the OpenHue CLI.
sag
354.3kElevenLabs text-to-speech with mac-style say UX.
weather
354.3kGet current weather and forecasts via wttr.in or Open-Meteo
casdoor
13.3kAn open-source AI-first Identity and Access Management (IAM) /AI MCP & agent gateway and auth server with web UI supporting OpenClaw, MCP, OAuth, OIDC, SAML, CAS, LDAP, SCIM, WebAuthn, TOTP, MFA, Face ID, Google Workspace, Azure AD
