CUTracer
A dynamic binary instrumentation tool for tracing and analyzing CUDA kernel instructions.
Install / Use
/learn @facebookresearch/CUTracerREADME
CUTracer
CUTracer is a CUDA binary instrumentation tool built on NVBit. It cleanly separates lightweight data collection (instrumentation) from host-side processing (analysis). Typical workflows include per-warp instruction histograms (delimited by GPU clock reads) and kernel hang detection.
Features
- NVBit-powered, runtime attach via
CUDA_INJECTION64_PATH(no app rebuild needed) - Multiple instrumentation modes: opcode-only, register trace, memory trace, random delay
- Built-in analyses:
- Instruction Histogram (for Proton/Triton workflows)
- Deadlock/Hang Detection
- Data Race Detection
- CUDA Graph and stream-capture aware flows
- Deterministic kernel log file naming and CSV outputs
Requirements
All requirements are aligned with NVBit.
Unique requirements:
- libzstd: Required for trace compression
Installation
- Clone the repository:
cd ~
git clone git@github.com:facebookresearch/CUTracer.git
cd CUTracer
Note for Meta internal users: CUTracer is also available at
fbcode/triton/tools/CUTracer/within fbsource. You can build viabuck2 build fbcode//triton/tools/CUTracer:cutracer.soinstead of the Makefile workflow.
- Install system dependencies (libzstd static library for self-contained builds):
# Ubuntu/Debian
# On most Ubuntu/Debian systems, libzstd-dev provides both shared and static libs (libzstd.a).
# You can verify this with: dpkg -L libzstd-dev | grep 'libzstd.a'
# If your distribution does not ship the static library in libzstd-dev, you may need to
# build zstd from source or install a distro-specific static libzstd package.
sudo apt-get install libzstd-dev
# CentOS/RHEL/Fedora (static library for portable builds)
sudo dnf install libzstd-static
# If static library is not available, the build will fall back to dynamic linking
# and display a warning. The resulting binary will not be self-contained.
- Download third-party dependencies:
./install_third_party.sh
This will download:
- NVBit (NVIDIA Binary Instrumentation Tool)
- nlohmann/json (JSON library for C++)
- Build the tool:
make -j$(nproc)
Quickstart
1. Install the Python CLI
cd ~/CUTracer/python
pip install .
2. Run your CUDA app with CUTracer
# Option A: Set CUTRACER_LIB_PATH once (recommended)
export CUTRACER_LIB_PATH=~/CUTracer/lib
cutracer trace -i tma_trace -- ./your_app
# Option B: Specify cutracer.so explicitly
cutracer trace -i tma_trace --cutracer-so ~/CUTracer/lib/cutracer.so -- ./your_app
# Option C: Run from the CUTracer project root (auto-discovers ./lib/cutracer.so)
cd ~/CUTracer
cutracer trace -i tma_trace -- ./your_app
# Option D: Kernel launch logger only (no instrumentation, no trace files)
cutracer trace -- ./your_app
3. Analyze the output
cutracer analyze warp-summary output.ndjson
cutracer query output.ndjson --filter "warp=24"
cutracer validate output.ndjson
Note: You can also use CUTracer without the Python CLI by setting the
CUDA_INJECTION64_PATHenvironment variable directly:CUDA_INJECTION64_PATH=~/CUTracer/lib/cutracer.so ./your_app
Configuration (env vars)
CUTRACER_INSTRUMENT: comma-separated modes:opcode_only,reg_trace,mem_trace,random_delayCUTRACER_ANALYSIS: comma-separated analyses:proton_instr_histogram,deadlock_detection,random_delay- Enabling
proton_instr_histogramauto-enablesopcode_only - Enabling
deadlock_detectionauto-enablesreg_trace - Enabling
random_delayauto-enablesrandom_delayinstrumentation; also requiresCUTRACER_DELAY_NSto be set
- Enabling
KERNEL_FILTERS: comma-separated substrings matching unmangled or mangled kernel namesINSTR_BEGIN,INSTR_END: static instruction index gate during instrumentationTOOL_VERBOSE: 0/1/2CUTRACER_TRACE_FORMAT: trace output format. Accepts string names or numeric values (replaces the legacyTRACE_FORMAT_NDJSONenv var, which is still accepted for backward compatibility)- ndjson or 2 (default): NDJSON uncompressed (
.ndjson) - text (or 0): Plain text (
.log, legacy format, verbose) - zstd (or 1): NDJSON+Zstd compressed (
.ndjson.zst, ~12x compression, 92% space savings) - clp (or 3): CLP Archive (
.clp)
- ndjson or 2 (default): NDJSON uncompressed (
CUTRACER_ZSTD_LEVEL: Zstd compression level (1-22, default 9)- Lower values (1-3): Faster compression, slightly larger output
- Higher values (19-22): Maximum compression, slower but smallest output
- Default of 9 provides balanced compression speed and ratio
CUTRACER_DELAY_NS: Max delay value in nanoseconds forrandom_delayanalysis (required whenrandom_delayis enabled)CUTRACER_DELAY_MIN_NS: Minimum delay in nanoseconds — floor for random mode (default: 0). Must be ≤CUTRACER_DELAY_NSCUTRACER_DELAY_MODE: Delay mode:random(per-thread random delay in[min, max], default) orfixed(same delay for all threads, often masks races)CUTRACER_DELAY_DUMP_PATH: Output path for delay config JSON file (for recording instrumentation patterns)CUTRACER_DELAY_LOAD_PATH: Input path for delay config JSON file (for replay mode - deterministic reproduction)CUTRACER_OUTPUT_DIR: Output directory for all CUTracer files (trace files and log files). Defaults to the current directory. The directory must exist and be writable.CUTRACER_CPU_CALLSTACK: Enable/disable CPU call stack capture at each kernel launch (default: 1 = enabled)- When enabled, the
kernel_metadatatrace event includes acpu_callstackarray with demangled C++ frame names
- When enabled, the
CUTRACER_KERNEL_TIMEOUT_S: Kernel execution time limit in seconds (default: 0 = disabled)- Terminates the process with SIGTERM when a kernel runs longer than this value
- Acts as a general safety valve, independent of deadlock detection (does not require
-a deadlock_detection)
CUTRACER_NO_DATA_TIMEOUT_S: No-data hang detection timeout in seconds (default: 15)- Terminates the process with SIGTERM when no trace data arrives for this duration
- Acts as a general safety valve, independent of deadlock detection (does not require
-a deadlock_detection) - Catches "silent" hangs where all warps are blocked on synchronization primitives with zero trace output
- Works whether the kernel went silent after producing some data, or never produced any data at all
- When
-a deadlock_detectionis also active, prints detailed warp status summary before termination - Set to 0 to disable
CUTRACER_TRACE_SIZE_LIMIT_MB: Maximum trace file size in MB (default: 0 = disabled)- When any trace file exceeds this limit, tracing is stopped for that kernel; kernel execution continues normally
- Useful for preventing runaway trace files from filling disk (e.g., during deadlocked kernels)
Notes:
- The tool sets
CUDA_MANAGED_FORCE_DEVICE_ALLOC=1to simplify channel memory handling. - Multiple analyses can be combined (e.g.,
CUTRACER_ANALYSIS=proton_instr_histogram,deadlock_detection). Each analysis auto-enables its required instrumentation mode.
Analyses
Instruction Histogram (proton_instr_histogram)
- Counts SASS instruction mnemonics per warp within regions delimited by clock reads (start/stop model; nested regions not supported)
- Output: one CSV per kernel launch with columns
warp_id,region_id,instruction,count
Example (Triton/Proton + IPC):
cd ~/CUTracer/tests/proton_tests
# 1) Collect histogram with CUTracer
CUDA_INJECTION64_PATH=~/CUTracer/lib/cutracer.so \
CUTRACER_ANALYSIS=proton_instr_histogram \
KERNEL_FILTERS=add_kernel \
python ./vector-add-instrumented.py
# 2) Run without CUTracer to generate a clean Chrome trace
python ./vector-add-instrumented.py
# 3) Merge and compute IPC
python ~/CUTracer/scripts/parse_instr_hist_trace.py \
--chrome-trace ./vector.chrome_trace \
--cutracer-trace ./kernel_*_add_kernel_hist.csv \
--cutracer-log ./cutracer_main_*.log \
--output vectoradd_ipc.csv
Deadlock / Hang Detection (deadlock_detection)
- Detects sustained hangs by identifying warps stuck in stable PC loops; logs and issues SIGTERM→SIGKILL if sustained
- Requires
reg_trace(auto-enabled)
Example (intentional loop):
cd ~/CUTracer/tests/hang_test
CUDA_INJECTION64_PATH=~/CUTracer/lib/cutracer.so \
CUTRACER_ANALYSIS=deadlock_detection \
python ./test_hang.py
Data Race Detection (random_delay)
- Data races depend on thread scheduling and timing — buggy code may appear correct by luck.
This analysis exposes hidden races by injecting random delays before synchronization-related SASS instructions (e.g.,
BAR,MEMBAR,ATOM,RED), disrupting the normal timing and forcing latent races to manifest as observable failures. - Each instrumentation point is randomly enabled/disabled (50% probability)
- Two delay modes:
random(default): Each thread gets a random delay in[0, CUTRACER_DELAY_NS]using GPU-side xorshift32 PRNG seeded withthreadIdx/blockIdx/clock. Creates per-thread timing skew that amplifies data races. Recommended.fixed: All threads get the same delay. Preserves relative timing between threads and often masks races rather than exposing them. Not recommended for race detection.
- Requires
CUTRACER_DELAY_NSto be set. Therandom_delayinstrumentation mode is auto-enabled.
Example:
CUTRACER_DELAY_NS=100000 \
CUTRACER_ANALYSIS=random_delay \
CUDA_INJECTION64_PATH=~/CUTracer/lib/cutracer.so \
python3 your_kernel.py
Delay Dump and Replay
CUTracer supports dumping delay configurations to JSON for deterministic reproduction of data races:
- Dump mode: Set
CUTRACER_DELAY_DUMP_PATHto save the random instrumentation pattern to a JSON file - Replay mode: Set
CUTRACER_DELAY_LOAD_PATHto load a saved config and rep
