Ingero - GPU Causal Observability

Featured in: awesome-ebpf · awesome-observability · awesome-sre-tools · awesome-cloud-native · awesome-profiling · Awesome-GPU · awesome-devops-mcp-servers · MCP Registry · Glama · mcpservers.org

Version: 0.8.2.13

The only GPU observability tool your AI assistant can talk to.

"What caused the GPU stall?" → "forward() at train.py:142 - cudaMalloc spiking 48ms during CPU contention. 9,829 calls, 847 scheduler preemptions."

Ingero is a production-grade eBPF agent that traces the full chain - from Linux kernel events through CUDA API calls to your Python source lines - with <2% overhead, zero code changes, and one binary.

Quick Start

# Install (Linux amd64 — see below for arm64/Docker)
VERSION=0.8.2
curl -fsSL "https://github.com/ingero-io/ingero/releases/download/v${VERSION}/ingero_${VERSION}_linux_amd64.tar.gz" | tar xz
sudo mv ingero /usr/local/bin/

# Trace your GPU workload
sudo ingero trace

# Diagnose what happened
ingero explain --since 5m

The "Why": Correlate a cudaStreamSync spike with sched_switch events - the host kernel preempted your thread.
The "Where": Map CUDA calls back to Python source lines in your PyTorch forward() pass.
The "Hidden Kernels": Trace the CUDA Driver API to see kernel launches by cuBLAS/cuDNN that bypass standard profilers.

No ClickHouse, no PostgreSQL, no MinIO - just one statically linked Go binary and embedded SQLite.

See a real AI investigation session - an AI assistant diagnosing GPU training issues on A100 and GH200 using only Ingero's MCP tools. No shell access, no manual SQL - just questions and answers.

What It Does

Ingero uses eBPF to trace GPU workloads at three layers, reads system metrics from /proc, and assembles causal chains that explain root causes:

CUDA Runtime uprobes - traces cudaMalloc, cudaFree, cudaLaunchKernel, cudaMemcpy, cudaMemcpyAsync, cudaStreamSync / cudaDeviceSynchronize via uprobes on libcudart.so
CUDA Driver uprobes - traces cuLaunchKernel, cuMemcpy, cuMemcpyAsync, cuCtxSynchronize, cuMemAlloc via uprobes on libcuda.so. Captures kernel launches from cuBLAS/cuDNN that bypass the runtime API.
Host tracepoints - traces sched_switch, sched_wakeup, mm_page_alloc, oom_kill, sched_process_exec/exit/fork for CPU scheduling, memory pressure, and process lifecycle
System context - reads CPU utilization, memory usage, load average, and swap from /proc (no eBPF, no root needed)

The causal engine correlates events across layers by timestamp and PID to produce automated root cause analysis with severity ranking and fix recommendations.

$ sudo ingero trace

  Ingero Trace  -  Live CUDA Event Stream
  Target: PID 4821 (python3)
  Library: /usr/lib/x86_64-linux-gnu/libcudart.so.12
  CUDA probes: 14 attached
  Driver probes: 10 attached
  Host probes: 7 attached

  System: CPU [████████░░░░░░░░░░░░] 47% | Mem [██████████████░░░░░░] 72% (11.2 GB free) | Load 3.2 | Swap 0 MB

  CUDA Runtime API                                               Events: 11,028
  ┌──────────────────────┬────────┬──────────┬──────────┬──────────┬─────────┐
  │ Operation            │ Count  │ p50      │ p95      │ p99      │ Flags   │
  ├──────────────────────┼────────┼──────────┼──────────┼──────────┼─────────┤
  │ cudaLaunchKernel     │ 11,009 │ 5.2 µs   │ 12.1 µs  │ 18.4 µs  │         │
  │ cudaMalloc           │     12 │ 125 µs   │ 2.1 ms   │ 8.4 ms   │ ⚠ p99  │
  │ cudaDeviceSynchronize│      7 │ 684 µs   │ 1.2 ms   │ 3.8 ms   │         │
  └──────────────────────┴────────┴──────────┴──────────┴──────────┴─────────┘

  CUDA Driver API                                                Events: 17,525
  ┌──────────────────────┬────────┬──────────┬──────────┬──────────┬─────────┐
  │ Operation            │ Count  │ p50      │ p95      │ p99      │ Flags   │
  ├──────────────────────┼────────┼──────────┼──────────┼──────────┼─────────┤
  │ cuLaunchKernel       │ 17,509 │ 4.8 µs   │ 11.3 µs  │ 16.2 µs  │         │
  │ cuMemAlloc           │     16 │ 98 µs    │ 1.8 ms   │ 7.1 ms   │         │
  └──────────────────────┴────────┴──────────┴──────────┴──────────┴─────────┘

  Host Context                                                   Events: 258
  ┌─────────────────┬────────┬──────────────────────────────────────────┐
  │ Event           │ Count  │ Detail                                   │
  ├─────────────────┼────────┼──────────────────────────────────────────┤
  │ mm_page_alloc   │    251 │ 1.0 MB allocated (order-0: 251)          │
  │ process_exit    │      7 │ 7 processes exited                       │
  └─────────────────┴────────┴──────────────────────────────────────────┘

  ⚠ cudaStreamSync p99 = 142ms  -  correlated with 23 sched_switch events
    (GPU thread preempted during sync wait, avg 2.1ms off-CPU)

What You'll Discover

Things no other GPU tool can show you.

"cuBLAS was launching 17,509 kernels and you couldn't see any of them." Most profilers trace only the CUDA Runtime API - but cuBLAS calls cuLaunchKernel (driver API) directly, bypassing the runtime. Ingero traces both layers: 11,009 runtime + 17,509 driver = complete visibility into every kernel launch.

"Your training slowed because logrotate stole 4 CPU cores." System Context shows CPU at 94%, Load 12.1. The CUDA table shows cudaStreamSync p99 jumping from 16ms to 142ms. The Host Context shows 847 sched_switch events. ingero explain assembles the full causal chain: logrotate preempted the training process → CUDA sync stalled → training throughput dropped 30%. Fix: nice -n 19 logrotate, or pin training to dedicated cores.

"Your model spends 38% of wall-clock time on data movement, not compute." nvidia-smi says "GPU utilization 98%", but the GPU is busy doing cudaMemcpy, not compute. Ingero's time-fraction breakdown makes this obvious. The fix (pinned memory, async transfers, larger batches) saves 30-50% wall-clock time.

"Your host is swapping and your GPU doesn't know it." System Context shows Swap 2.1 GB. cudaMalloc p99 rises from 0.02ms to 8.4ms. No GPU tool shows this - nvidia-smi says GPU memory is fine, but host-side CUDA bookkeeping is hitting swap.

"Ask your AI: what line of my code caused the GPU stall?" Your AI assistant calls Ingero's MCP server and answers in one shot: "The issue is in forward() at train.py:142, calling cudaMalloc through PyTorch. 9,829 calls, avg 3.1ms but spiking to 48.3ms during CPU contention." Resolved Python source lines, native symbols, timing stats - no logs, no manual SQL, no hex addresses. The engineer asks questions in plain English and gets production root causes back.

See It In Action

<details><summary><code>sudo ingero check</code> — system readiness</summary> <br> <img src="docs/assets/readme-check.gif" width="800" alt="ingero check verifying kernel, BTF, NVIDIA driver, GPU model, CUDA libraries, and active processes"> </details> <details><summary><code>sudo ingero trace</code> — live event stream</summary> <br> <img src="docs/assets/readme-trace.gif" width="800" alt="ingero trace showing live CUDA Runtime and Driver API statistics with rolling p50/p95/p99 latencies and host context"> </details> <details><summary><code>ingero explain --since 5m</code> — automated diagnosis</summary> <br> <img src="docs/assets/readme-explain.gif" width="800" alt="ingero explain producing incident report with causal chains, root cause analysis, and fix recommendations"> </details> <details><summary><code>ingero demo --no-gpu incident</code> — try without a GPU</summary> <br> <img src="docs/assets/readme-demo-nogpu.gif" width="800" alt="ingero demo running in synthetic mode without GPU, showing full causal chain diagnosis"> </details>

ingero demo                 # run all 6 scenarios (auto-detects GPU)
ingero demo incident        # full causal chain in 30 seconds
ingero demo --no-gpu        # synthetic mode (no root, no GPU needed)
sudo ingero demo --gpu      # real GPU + eBPF tracing

Scenarios

| Scenario | What It Reveals | |----------|----------------| | incident | CPU spike + sched_switch storm → cudaStreamSync 8.5x latency spike → full causal chain with root cause and fix | | cold-start | First CUDA calls take 50-200x longer than steady state (CUDA context init) | | memcpy-bottleneck | cudaMemcpy dominates wall-clock time (38%), not compute - nvidia-smi lies | | periodic-spike | cudaMalloc spikes 50x every ~200 batches (PyTorch caching allocator) | | cpu-contention | Host CPU preemption causes CUD

Ingero

Install / Use

README