CUTracer

CUTracer is a CUDA binary instrumentation tool built on NVBit. It cleanly separates lightweight data collection (instrumentation) from host-side processing (analysis). Typical workflows include per-warp instruction histograms (delimited by GPU clock reads) and kernel hang detection.

Features

NVBit-powered, runtime attach via CUDA_INJECTION64_PATH (no app rebuild needed)
Multiple instrumentation modes: opcode-only, register trace, memory trace, random delay
Built-in analyses:
- Instruction Histogram (for Proton/Triton workflows)
- Deadlock/Hang Detection
- Data Race Detection
CUDA Graph and stream-capture aware flows
Deterministic kernel log file naming and CSV outputs

Requirements

All requirements are aligned with NVBit.

Unique requirements:

libzstd: Required for trace compression

Installation

Clone the repository:

cd ~
git clone git@github.com:facebookresearch/CUTracer.git
cd CUTracer

Note for Meta internal users: CUTracer is also available at fbcode/triton/tools/CUTracer/ within fbsource. You can build via buck2 build fbcode//triton/tools/CUTracer:cutracer.so instead of the Makefile workflow.

Install system dependencies (libzstd static library for self-contained builds):

# Ubuntu/Debian
# On most Ubuntu/Debian systems, libzstd-dev provides both shared and static libs (libzstd.a).
# You can verify this with: dpkg -L libzstd-dev | grep 'libzstd.a'
# If your distribution does not ship the static library in libzstd-dev, you may need to
# build zstd from source or install a distro-specific static libzstd package.
sudo apt-get install libzstd-dev

# CentOS/RHEL/Fedora (static library for portable builds)
sudo dnf install libzstd-static

# If static library is not available, the build will fall back to dynamic linking
# and display a warning. The resulting binary will not be self-contained.

Download third-party dependencies:

./install_third_party.sh

This will download:

NVBit (NVIDIA Binary Instrumentation Tool)
nlohmann/json (JSON library for C++)

Build the tool:

make -j$(nproc)

Quickstart

1. Install the Python CLI

cd ~/CUTracer/python
pip install .

2. Run your CUDA app with CUTracer

# Option A: Set CUTRACER_LIB_PATH once (recommended)
export CUTRACER_LIB_PATH=~/CUTracer/lib
cutracer trace -i tma_trace -- ./your_app

# Option B: Specify cutracer.so explicitly
cutracer trace -i tma_trace --cutracer-so ~/CUTracer/lib/cutracer.so -- ./your_app

# Option C: Run from the CUTracer project root (auto-discovers ./lib/cutracer.so)
cd ~/CUTracer
cutracer trace -i tma_trace -- ./your_app

# Option D: Kernel launch logger only (no instrumentation, no trace files)
cutracer trace -- ./your_app

3. Analyze the output

cutracer analyze warp-summary output.ndjson
cutracer query output.ndjson --filter "warp=24"
cutracer validate output.ndjson

Note: You can also use CUTracer without the Python CLI by setting the CUDA_INJECTION64_PATH environment variable directly:
CUDA_INJECTION64_PATH=~/CUTracer/lib/cutracer.so ./your_app

Configuration (env vars)

CUTRACER_INSTRUMENT: comma-separated modes: opcode_only, reg_trace, mem_trace, random_delay
CUTRACER_ANALYSIS: comma-separated analyses: proton_instr_histogram, deadlock_detection, random_delay
- Enabling proton_instr_histogram auto-enables opcode_only
- Enabling deadlock_detection auto-enables reg_trace
- Enabling random_delay auto-enables random_delay instrumentation; also requires CUTRACER_DELAY_NS to be set
KERNEL_FILTERS: comma-separated substrings matching unmangled or mangled kernel names
INSTR_BEGIN, INSTR_END: static instruction index gate during instrumentation
TOOL_VERBOSE: 0/1/2
CUTRACER_TRACE_FORMAT: trace output format. Accepts string names or numeric values (replaces the legacy TRACE_FORMAT_NDJSON env var, which is still accepted for backward compatibility)
- ndjson or 2 (default): NDJSON uncompressed (.ndjson)
- text (or 0): Plain text (.log, legacy format, verbose)
- zstd (or 1): NDJSON+Zstd compressed (.ndjson.zst, ~12x compression, 92% space savings)
- clp (or 3): CLP Archive (.clp)
CUTRACER_ZSTD_LEVEL: Zstd compression level (1-22, default 9)
- Lower values (1-3): Faster compression, slightly larger output
- Higher values (19-22): Maximum compression, slower but smallest output
- Default of 9 provides balanced compression speed and ratio
CUTRACER_DELAY_NS: Max delay value in nanoseconds for random_delay analysis (required when random_delay is enabled)
CUTRACER_DELAY_MIN_NS: Minimum delay in nanoseconds — floor for random mode (default: 0). Must be ≤ CUTRACER_DELAY_NS
CUTRACER_DELAY_MODE: Delay mode: random (per-thread random delay in [min, max], default) or fixed (same delay for all threads, often masks races)
CUTRACER_DELAY_DUMP_PATH: Output path for delay config JSON file (for recording instrumentation patterns)
CUTRACER_DELAY_LOAD_PATH: Input path for delay config JSON file (for replay mode - deterministic reproduction)
CUTRACER_OUTPUT_DIR: Output directory for all CUTracer files (trace files and log files). Defaults to the current directory. The directory must exist and be writable.
CUTRACER_CPU_CALLSTACK: Enable/disable CPU call stack capture at each kernel launch (default: 1 = enabled)
- When enabled, the kernel_metadata trace event includes a cpu_callstack array with demangled C++ frame names
CUTRACER_KERNEL_TIMEOUT_S: Kernel execution time limit in seconds (default: 0 = disabled)
- Terminates the process with SIGTERM when a kernel runs longer than this value
- Acts as a general safety valve, independent of deadlock detection (does not require -a deadlock_detection)
CUTRACER_NO_DATA_TIMEOUT_S: No-data hang detection timeout in seconds (default: 15)
- Terminates the process with SIGTERM when no trace data arrives for this duration
- Acts as a general safety valve, independent of deadlock detection (does not require -a deadlock_detection)
- Catches "silent" hangs where all warps are blocked on synchronization primitives with zero trace output
- Works whether the kernel went silent after producing some data, or never produced any data at all
- When -a deadlock_detection is also active, prints detailed warp status summary before termination
- Set to 0 to disable
CUTRACER_TRACE_SIZE_LIMIT_MB: Maximum trace file size in MB (default: 0 = disabled)
- When any trace file exceeds this limit, tracing is stopped for that kernel; kernel execution continues normally
- Useful for preventing runaway trace files from filling disk (e.g., during deadlocked kernels)

Notes:

The tool sets CUDA_MANAGED_FORCE_DEVICE_ALLOC=1 to simplify channel memory handling.
Multiple analyses can be combined (e.g., CUTRACER_ANALYSIS=proton_instr_histogram,deadlock_detection). Each analysis auto-enables its required instrumentation mode.

Analyses

Instruction Histogram (proton_instr_histogram)

Counts SASS instruction mnemonics per warp within regions delimited by clock reads (start/stop model; nested regions not supported)
Output: one CSV per kernel launch with columns warp_id,region_id,instruction,count

Example (Triton/Proton + IPC):

cd ~/CUTracer/tests/proton_tests

# 1) Collect histogram with CUTracer
CUDA_INJECTION64_PATH=~/CUTracer/lib/cutracer.so \
CUTRACER_ANALYSIS=proton_instr_histogram \
KERNEL_FILTERS=add_kernel \
python ./vector-add-instrumented.py

# 2) Run without CUTracer to generate a clean Chrome trace
python ./vector-add-instrumented.py

# 3) Merge and compute IPC
python ~/CUTracer/scripts/parse_instr_hist_trace.py \
  --chrome-trace ./vector.chrome_trace \
  --cutracer-trace ./kernel_*_add_kernel_hist.csv \
  --cutracer-log ./cutracer_main_*.log \
  --output vectoradd_ipc.csv

Deadlock / Hang Detection (deadlock_detection)

Detects sustained hangs by identifying warps stuck in stable PC loops; logs and issues SIGTERM→SIGKILL if sustained
Requires reg_trace (auto-enabled)

Example (intentional loop):

cd ~/CUTracer/tests/hang_test
CUDA_INJECTION64_PATH=~/CUTracer/lib/cutracer.so \
CUTRACER_ANALYSIS=deadlock_detection \
python ./test_hang.py

Data Race Detection (random_delay)

Data races depend on thread scheduling and timing — buggy code may appear correct by luck. This analysis exposes hidden races by injecting random delays before synchronization-related SASS instructions (e.g., BAR, MEMBAR, ATOM, RED), disrupting the normal timing and forcing latent races to manifest as observable failures.
Each instrumentation point is randomly enabled/disabled (50% probability)
Two delay modes:
- random (default): Each thread gets a random delay in [0, CUTRACER_DELAY_NS] using GPU-side xorshift32 PRNG seeded with threadIdx/blockIdx/clock. Creates per-thread timing skew that amplifies data races. Recommended.
- fixed: All threads get the same delay. Preserves relative timing between threads and often masks races rather than exposing them. Not recommended for race detection.
Requires CUTRACER_DELAY_NS to be set. The random_delay instrumentation mode is auto-enabled.

Example:

CUTRACER_DELAY_NS=100000 \
CUTRACER_ANALYSIS=random_delay \
CUDA_INJECTION64_PATH=~/CUTracer/lib/cutracer.so \
python3 your_kernel.py

Delay Dump and Replay

CUTracer supports dumping delay configurations to JSON for deterministic reproduction of data races:

Dump mode: Set CUTRACER_DELAY_DUMP_PATH to save the random instrumentation pattern to a JSON file
Replay mode: Set CUTRACER_DELAY_LOAD_PATH to load a saved config and rep

CUTracer

Install / Use

README

CUTracer

Features

Requirements

Installation

Quickstart

1. Install the Python CLI

2. Run your CUDA app with CUTracer

3. Analyze the output

Configuration (env vars)

Analyses

Instruction Histogram (proton_instr_histogram)

Deadlock / Hang Detection (deadlock_detection)

Data Race Detection (random_delay)

Delay Dump and Replay