NVSentinel

NVSentinel is a cross-platform fault remediation service designed to rapidly remediate runtime node-level issues in GPU-accelerated computing environments

Generate Convert Improve

Install / Use

/learn @NVIDIA/NVSentinel

About this skill

Quality Score

0/100

README

NVSentinel

GPU Fault Detection and Remediation for Kubernetes

NVSentinel automatically detects, classifies, and remediates hardware and software faults in GPU nodes. It monitors GPU health, system logs, and cloud provider maintenance events, then takes action: cordoning faulty nodes, draining workloads, and triggering break-fix workflows.

[!NOTE] Beta / Stable NVSentinel is ready for production testing and use. APIs, configurations, and features may change between releases. If you encounter issues, please open an issue or start a discussion.

🚀 Quick Start

Prerequisites

Kubernetes 1.25+
Helm 3.0+
NVIDIA GPU Operator (includes DCGM for GPU monitoring)

Installation

NVSENTINEL_VERSION=v1.0.0
# Install from GitHub Container Registry
helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version "$NVSENTINEL_VERSION" \
  --namespace nvsentinel \
  --create-namespace

# View chart information
helm show chart oci://ghcr.io/nvidia/nvsentinel --version "$NVSENTINEL_VERSION"

✨ Key Features

🔍 Comprehensive Monitoring: Real-time detection of GPU, NVSwitch, and system-level failures
🔧 Automated Remediation: Intelligent fault handling with cordon, drain, and break-fix workflows
📦 Modular Architecture: Pluggable health monitors with standardized gRPC interfaces
🔄 High Availability: Kubernetes-native design with replica support and leader election
⚡ Real-time Processing: Event-driven architecture with immediate fault response
📊 Persistent Storage: MongoDB-based event store with change streams for real-time updates
🛡️ Graceful Handling: Coordinated workload eviction with configurable timeouts
🏷️ Metadata Enrichment: Automatic augmentation of health events with cloud provider and node metadata information

🧪 Complete Setup Guide

For a full installation with all dependencies, follow these steps:

1. Install cert-manager (for TLS)

helm repo add jetstack https://charts.jetstack.io --force-update
helm upgrade --install cert-manager jetstack/cert-manager \
  --namespace cert-manager --create-namespace \
  --version v1.19.1 --set installCRDs=true \
  --wait

2. Install Prometheus (for metrics)

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts --force-update
helm upgrade --install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring --create-namespace \
  --set prometheus.enabled=true \
  --set alertmanager.enabled=false \
  --set grafana.enabled=false \
  --set kubeStateMetrics.enabled=false \
  --set nodeExporter.enabled=false \
  --wait

3. Install NVSentinel

NVSENTINEL_VERSION=v1.0.0

helm upgrade --install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --namespace nvsentinel --create-namespace \
  --version "$NVSENTINEL_VERSION" \
  --timeout 15m \
  --wait

4. Verify Installation

kubectl get pods -n nvsentinel
kubectl get nodes  # Verify GPU nodes are visible

# Run comprehensive validation
./scripts/validate-nvsentinel.sh --version "$NVSENTINEL_VERSION" --verbose

Testing: The example above uses default settings. For production, customize values for your environment.

Production: By default, only health monitoring is enabled. Enable fault quarantine and remediation modules via Helm values. See Configuration below.

🎮 Try the Demo

Want to see NVSentinel in action without GPU hardware? Try our Local Fault Injection Demo:

🚀 5-minute setup - runs entirely in a local KIND cluster
🔍 Real pipeline - see fault detection → quarantine → node cordon
🎯 No GPU required - uses simulated DCGM for testing

cd demos/local-fault-injection-demo
make demo  # Automated: creates cluster, installs NVSentinel, injects fault, verifies cordon

Perfect for learning, presentations, or CI/CD testing!

🏗️ Architecture

NVSentinel follows a microservices architecture with modular health monitors and core processing modules:

graph LR
    subgraph "Health Monitors"
        GPU["GPU Health Monitor<br/>(DCGM Integration)"]
        SYS["Syslog Health Monitor<br/>(Journalctl)"]
        CSP["CSP Health Monitor<br/>(CSP APIs)"]
        K8SOM["Kubernetes Object Monitor<br/>(CEL Policies)"]
    end
    
    subgraph "Core Processing"
        PC["Platform Connectors<br/>(gRPC Server)"]
        STORE[("MongoDB Store<br/>(Event Database)")]
        FQ["Fault Quarantine<br/>(Node Cordon)"]
        ND["Node Drainer<br/>(Workload Eviction)"]
        FR["Fault Remediation<br/>(Break-Fix Integration)"]
        HEA["Health Events Analyzer<br/>(Pattern Analysis)"]
        LBL["Labeler<br/>(Node Labels)"]
    end
    
    subgraph "Kubernetes Cluster"
        K8S["Kubernetes API<br/>(Nodes, Pods, Events)"]
    end
    
    GPU -->|gRPC| PC
    SYS -->|gRPC| PC
    CSP -->|gRPC| PC
    K8SOM -->|gRPC| PC

    PC -->|persist| STORE
    PC <-->|update status| K8S
    
    FQ -.->|watch changes| STORE
    FQ -->|cordon| K8S
    
    ND -.->|watch changes| STORE
    ND -->|drain| K8S
    
    FR -.->|watch changes| STORE
    FR -->|create CRDs| K8S
    
    HEA -.->|watch changes| STORE
    
    LBL -->|update labels| K8S

    K8SOM -.->|watch changes| K8S

Data Flow:

Health Monitors detect hardware/software faults and send events via gRPC to Platform Connectors
Platform Connectors validate, persist events to MongoDB, and update Kubernetes node conditions
Core Modules independently watch MongoDB change streams for relevant events
Modules interact with Kubernetes API to cordon, drain, label nodes, and create remediation CRDs
Labeler monitors pods to automatically label nodes with DCGM and driver versions

Note: All modules operate independently without direct communication. Coordination happens through MongoDB change streams and Kubernetes API.

⚙️ Configuration

NVSentinel is highly configurable with options for each module. For complete configuration documentation, see the Helm Chart README.

Quick Configuration Overview

global:
  dryRun: false  # Test mode - log actions without executing
  
  # Health Monitors (enabled by default)
  gpuHealthMonitor:
    enabled: true
  syslogHealthMonitor:
    enabled: true

  # Core Modules (disabled by default - enable for production)
  faultQuarantine:
    enabled: false
  nodeDrainer:
    enabled: false
  faultRemediation:
    enabled: false
  janitor:
    enabled: false
  mongodbStore:
    enabled: false

Configuration Resources:

Helm Chart Configuration Guide: Complete configuration reference
values-full.yaml: Detailed reference with all options
values.yaml: Default values

📦 Module Details

For detailed module configuration, see the Helm Chart Configuration Guide.

🔍 Health Monitors

GPU Health Monitor: Monitors GPU hardware health via DCGM - detects thermal issues, ECC errors, and XID events
Syslog Health Monitor: Analyzes system logs for hardware and software fault patterns via journalctl
CSP Health Monitor: Integrates with cloud provider APIs (GCP/AWS) for maintenance events
Kubernetes Object Monitor: Policy-based monitoring for any Kubernetes resource using CEL expressions

🏗️ Core Modules

Platform Connectors: Receives health events from monitors via gRPC, persists to MongoDB, and updates Kubernetes node status
Fault Quarantine: Watches MongoDB for health events and cordons nodes based on configurable CEL rules
Node Drainer: Gracefully evicts workloads from cordoned nodes with per-namespace eviction strategies
Fault Remediation: Triggers external break-fix systems by creating maintenance CRDs after drain completion
Janitor: Executes node reboots and terminations via cloud provider APIs
Health Events Analyzer: Analyzes event patterns and generates recommended actions
Event Exporter: Streams health events to external systems in CloudEvents format
MongoDB Store: Persistent storage for health events with real-time change streams
Labeler: Automatically labels nodes with DCGM and driver versions for self-configuration
Metadata Collector: Gathers GPU and NVSwitch topology information
Log Collection: Collects diagnostic logs and GPU reports for troubleshooting

📋 Requirements

Kubernetes: 1.25 or later
Helm: 3.0 or later
NVIDIA GPU Operator: For GPU monitoring capabilities (includes DCGM)
Storage: Persistent storage for MongoDB (recommended 10GB+)
Network: Cluster networking for inter-service communication

🤝 Contributing

We welcome contributions! Here's how to get started:

Ways to Contribute:

🐛 Report bugs and request features via issues
🧭 See what we're working on in the roadmap
📝 Improve document

Related Skills

diffs

339.5k

Use the diffs tool to produce real, shareable diffs (viewer URL, file artifact, or both) instead of manual edit summaries.

clearshot

Structured screenshot analysis for UI implementation and critique. Analyzes every UI screenshot with a 5×5 spatial grid, full element inventory, and design system extraction — facts and taste together, every time. Escalates to full implementation blueprint when building. Trigger on any digital interface image file (png, jpg, gif, webp — websites, apps, dashboards, mockups, wireframes) or commands like 'analyse this screenshot,' 'rebuild this,' 'match this design,' 'clone this.' Skip for non-UI images (photos, memes, charts) unless the user explicitly wants to build a UI from them. Does NOT trigger on HTML source code, CSS, SVGs, or any code pasted as text.

openpencil

1.8k

The world's first open-source AI-native vector design tool and the first to feature concurrent Agent Teams. Design-as-Code. Turn prompts into UI directly on the live canvas. A modern alternative to Pencil.

ui-ux-pro-max-skill

53.5k

An AI SKILL that provide design intelligence for building professional UI/UX multiple platforms