SkillAgentSearch skills...

CUTracer

A dynamic binary instrumentation tool for tracing and analyzing CUDA kernel instructions.

Install / Use

/learn @facebookresearch/CUTracer

README

CUTracer

CUTracer is a CUDA binary instrumentation tool built on NVBit. It cleanly separates lightweight data collection (instrumentation) from host-side processing (analysis). Typical workflows include per-warp instruction histograms (delimited by GPU clock reads) and kernel hang detection.

Features

  • NVBit-powered, runtime attach via CUDA_INJECTION64_PATH (no app rebuild needed)
  • Multiple instrumentation modes: opcode-only, register trace, memory trace, random delay
  • Built-in analyses:
    • Instruction Histogram (for Proton/Triton workflows)
    • Deadlock/Hang Detection
    • Data Race Detection
  • CUDA Graph and stream-capture aware flows
  • Deterministic kernel log file naming and CSV outputs

Requirements

All requirements are aligned with NVBit.

Unique requirements:

  • libzstd: Required for trace compression

Installation

  1. Clone the repository:
cd ~
git clone git@github.com:facebookresearch/CUTracer.git
cd CUTracer

Note for Meta internal users: CUTracer is also available at fbcode/triton/tools/CUTracer/ within fbsource. You can build via buck2 build fbcode//triton/tools/CUTracer:cutracer.so instead of the Makefile workflow.

  1. Install system dependencies (libzstd static library for self-contained builds):
# Ubuntu/Debian
# On most Ubuntu/Debian systems, libzstd-dev provides both shared and static libs (libzstd.a).
# You can verify this with: dpkg -L libzstd-dev | grep 'libzstd.a'
# If your distribution does not ship the static library in libzstd-dev, you may need to
# build zstd from source or install a distro-specific static libzstd package.
sudo apt-get install libzstd-dev

# CentOS/RHEL/Fedora (static library for portable builds)
sudo dnf install libzstd-static

# If static library is not available, the build will fall back to dynamic linking
# and display a warning. The resulting binary will not be self-contained.
  1. Download third-party dependencies:
./install_third_party.sh

This will download:

  • NVBit (NVIDIA Binary Instrumentation Tool)
  • nlohmann/json (JSON library for C++)
  1. Build the tool:
make -j$(nproc)

Quickstart

1. Install the Python CLI

cd ~/CUTracer/python
pip install .

2. Run your CUDA app with CUTracer

# Option A: Set CUTRACER_LIB_PATH once (recommended)
export CUTRACER_LIB_PATH=~/CUTracer/lib
cutracer trace -i tma_trace -- ./your_app

# Option B: Specify cutracer.so explicitly
cutracer trace -i tma_trace --cutracer-so ~/CUTracer/lib/cutracer.so -- ./your_app

# Option C: Run from the CUTracer project root (auto-discovers ./lib/cutracer.so)
cd ~/CUTracer
cutracer trace -i tma_trace -- ./your_app

# Option D: Kernel launch logger only (no instrumentation, no trace files)
cutracer trace -- ./your_app

3. Analyze the output

cutracer analyze warp-summary output.ndjson
cutracer query output.ndjson --filter "warp=24"
cutracer validate output.ndjson

Note: You can also use CUTracer without the Python CLI by setting the CUDA_INJECTION64_PATH environment variable directly:

CUDA_INJECTION64_PATH=~/CUTracer/lib/cutracer.so ./your_app

Configuration (env vars)

  • CUTRACER_INSTRUMENT: comma-separated modes: opcode_only, reg_trace, mem_trace, random_delay
  • CUTRACER_ANALYSIS: comma-separated analyses: proton_instr_histogram, deadlock_detection, random_delay
    • Enabling proton_instr_histogram auto-enables opcode_only
    • Enabling deadlock_detection auto-enables reg_trace
    • Enabling random_delay auto-enables random_delay instrumentation; also requires CUTRACER_DELAY_NS to be set
  • KERNEL_FILTERS: comma-separated substrings matching unmangled or mangled kernel names
  • INSTR_BEGIN, INSTR_END: static instruction index gate during instrumentation
  • TOOL_VERBOSE: 0/1/2
  • CUTRACER_TRACE_FORMAT: trace output format. Accepts string names or numeric values (replaces the legacy TRACE_FORMAT_NDJSON env var, which is still accepted for backward compatibility)
    • ndjson or 2 (default): NDJSON uncompressed (.ndjson)
    • text (or 0): Plain text (.log, legacy format, verbose)
    • zstd (or 1): NDJSON+Zstd compressed (.ndjson.zst, ~12x compression, 92% space savings)
    • clp (or 3): CLP Archive (.clp)
  • CUTRACER_ZSTD_LEVEL: Zstd compression level (1-22, default 9)
    • Lower values (1-3): Faster compression, slightly larger output
    • Higher values (19-22): Maximum compression, slower but smallest output
    • Default of 9 provides balanced compression speed and ratio
  • CUTRACER_DELAY_NS: Max delay value in nanoseconds for random_delay analysis (required when random_delay is enabled)
  • CUTRACER_DELAY_MIN_NS: Minimum delay in nanoseconds — floor for random mode (default: 0). Must be ≤ CUTRACER_DELAY_NS
  • CUTRACER_DELAY_MODE: Delay mode: random (per-thread random delay in [min, max], default) or fixed (same delay for all threads, often masks races)
  • CUTRACER_DELAY_DUMP_PATH: Output path for delay config JSON file (for recording instrumentation patterns)
  • CUTRACER_DELAY_LOAD_PATH: Input path for delay config JSON file (for replay mode - deterministic reproduction)
  • CUTRACER_OUTPUT_DIR: Output directory for all CUTracer files (trace files and log files). Defaults to the current directory. The directory must exist and be writable.
  • CUTRACER_CPU_CALLSTACK: Enable/disable CPU call stack capture at each kernel launch (default: 1 = enabled)
    • When enabled, the kernel_metadata trace event includes a cpu_callstack array with demangled C++ frame names
  • CUTRACER_KERNEL_TIMEOUT_S: Kernel execution time limit in seconds (default: 0 = disabled)
    • Terminates the process with SIGTERM when a kernel runs longer than this value
    • Acts as a general safety valve, independent of deadlock detection (does not require -a deadlock_detection)
  • CUTRACER_NO_DATA_TIMEOUT_S: No-data hang detection timeout in seconds (default: 15)
    • Terminates the process with SIGTERM when no trace data arrives for this duration
    • Acts as a general safety valve, independent of deadlock detection (does not require -a deadlock_detection)
    • Catches "silent" hangs where all warps are blocked on synchronization primitives with zero trace output
    • Works whether the kernel went silent after producing some data, or never produced any data at all
    • When -a deadlock_detection is also active, prints detailed warp status summary before termination
    • Set to 0 to disable
  • CUTRACER_TRACE_SIZE_LIMIT_MB: Maximum trace file size in MB (default: 0 = disabled)
    • When any trace file exceeds this limit, tracing is stopped for that kernel; kernel execution continues normally
    • Useful for preventing runaway trace files from filling disk (e.g., during deadlocked kernels)

Notes:

  • The tool sets CUDA_MANAGED_FORCE_DEVICE_ALLOC=1 to simplify channel memory handling.
  • Multiple analyses can be combined (e.g., CUTRACER_ANALYSIS=proton_instr_histogram,deadlock_detection). Each analysis auto-enables its required instrumentation mode.

Analyses

Instruction Histogram (proton_instr_histogram)

  • Counts SASS instruction mnemonics per warp within regions delimited by clock reads (start/stop model; nested regions not supported)
  • Output: one CSV per kernel launch with columns warp_id,region_id,instruction,count

Example (Triton/Proton + IPC):

cd ~/CUTracer/tests/proton_tests

# 1) Collect histogram with CUTracer
CUDA_INJECTION64_PATH=~/CUTracer/lib/cutracer.so \
CUTRACER_ANALYSIS=proton_instr_histogram \
KERNEL_FILTERS=add_kernel \
python ./vector-add-instrumented.py

# 2) Run without CUTracer to generate a clean Chrome trace
python ./vector-add-instrumented.py

# 3) Merge and compute IPC
python ~/CUTracer/scripts/parse_instr_hist_trace.py \
  --chrome-trace ./vector.chrome_trace \
  --cutracer-trace ./kernel_*_add_kernel_hist.csv \
  --cutracer-log ./cutracer_main_*.log \
  --output vectoradd_ipc.csv

Deadlock / Hang Detection (deadlock_detection)

  • Detects sustained hangs by identifying warps stuck in stable PC loops; logs and issues SIGTERM→SIGKILL if sustained
  • Requires reg_trace (auto-enabled)

Example (intentional loop):

cd ~/CUTracer/tests/hang_test
CUDA_INJECTION64_PATH=~/CUTracer/lib/cutracer.so \
CUTRACER_ANALYSIS=deadlock_detection \
python ./test_hang.py

Data Race Detection (random_delay)

  • Data races depend on thread scheduling and timing — buggy code may appear correct by luck. This analysis exposes hidden races by injecting random delays before synchronization-related SASS instructions (e.g., BAR, MEMBAR, ATOM, RED), disrupting the normal timing and forcing latent races to manifest as observable failures.
  • Each instrumentation point is randomly enabled/disabled (50% probability)
  • Two delay modes:
    • random (default): Each thread gets a random delay in [0, CUTRACER_DELAY_NS] using GPU-side xorshift32 PRNG seeded with threadIdx/blockIdx/clock. Creates per-thread timing skew that amplifies data races. Recommended.
    • fixed: All threads get the same delay. Preserves relative timing between threads and often masks races rather than exposing them. Not recommended for race detection.
  • Requires CUTRACER_DELAY_NS to be set. The random_delay instrumentation mode is auto-enabled.

Example:

CUTRACER_DELAY_NS=100000 \
CUTRACER_ANALYSIS=random_delay \
CUDA_INJECTION64_PATH=~/CUTracer/lib/cutracer.so \
python3 your_kernel.py

Delay Dump and Replay

CUTracer supports dumping delay configurations to JSON for deterministic reproduction of data races:

  • Dump mode: Set CUTRACER_DELAY_DUMP_PATH to save the random instrumentation pattern to a JSON file
  • Replay mode: Set CUTRACER_DELAY_LOAD_PATH to load a saved config and rep
View on GitHub
GitHub Stars37
CategoryDevelopment
Updated1d ago
Forks6

Languages

Python

Security Score

95/100

Audited on Mar 31, 2026

No findings