Bloodhound

A deterministic simulation testing platform for hunting bugs in distributed systems.

Bloodhound uses a modified QEMU hypervisor to provide perfect reproducibility for containerized applications, enabling systematic exploration of failure scenarios that are impossible to reproduce with traditional testing.

Version 0.2.0 - OCI image integration, registry auth, container auto-translation, property checking. See Release Notes.

Research Note: This project is co-authored with Claude (Anthropic) as an experiment in AI-assisted systems programming. See CLAUDE.md for project philosophy and AI collaboration guidelines.

Platform Support: Linux x86_64 fully tested with real VMs. macOS supports harness/simulation mode only. See Platform Support for details.

Features

Deterministic Execution: Same seed produces identical execution every time
Language Agnostic: Test any containerized application (Go, Rust, Java, Python, etc.)
Fault Injection: Network partitions, disk failures, process crashes, clock skew
Time-Travel Debugging: Full replay capability with GDB integration
Coverage-Guided Exploration: Intelligent state space exploration
Docker Compose Integration: Test multi-service stacks directly

Architecture

┌─────────────────────────────────────────────────────────────┐
│                       BLOODHOUND                            │
│              (Modified QEMU Hypervisor)                     │
├─────────────────────────────────────────────────────────────┤
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐      │
│  │ Virtual Time │  │ Fault Inject │  │ State Snap   │      │
│  │ (TSC, HPET)  │  │ (Net, Disk)  │  │ (CoW, Tree)  │      │
│  └──────────────┘  └──────────────┘  └──────────────┘      │
├─────────────────────────────────────────────────────────────┤
│                    GUEST VMs (Containers)                   │
│  ┌────────┐  ┌────────┐  ┌────────┐  ┌────────┐           │
│  │  web   │  │ redis  │  │postgres│  │  ...   │           │
│  └────────┘  └────────┘  └────────┘  └────────┘           │
└─────────────────────────────────────────────────────────────┘

Quick Start

Installation

Option 1: Download Release Binary (Recommended)

# Linux x86_64
curl -LO https://github.com/nerdsane/bloodhound/releases/latest/download/bloodhound-linux-amd64.tar.gz
tar -xzf bloodhound-linux-amd64.tar.gz
sudo mv bloodhound /usr/local/bin/

# macOS Intel
curl -LO https://github.com/nerdsane/bloodhound/releases/latest/download/bloodhound-darwin-amd64.tar.gz
tar -xzf bloodhound-darwin-amd64.tar.gz
sudo mv bloodhound /usr/local/bin/

# macOS Apple Silicon
curl -LO https://github.com/nerdsane/bloodhound/releases/latest/download/bloodhound-darwin-arm64.tar.gz
tar -xzf bloodhound-darwin-arm64.tar.gz
sudo mv bloodhound /usr/local/bin/

# Verify installation
bloodhound --version

Option 2: Install from Source

# Using cargo
cargo install --git https://github.com/nerdsane/bloodhound

# Or build manually
git clone https://github.com/nerdsane/bloodhound.git
cd bloodhound
cargo build --release

Option 3: Full Setup with Real VMs (Linux Only)

# Clone the repository
git clone https://github.com/nerdsane/bloodhound.git
cd bloodhound

# Build the CLI
cargo build --release

# Build the deterministic kernel
./scripts/build-guest.sh

# Clone and build the patched QEMU hypervisor
git clone https://github.com/nerdsane/qemu.git
cd qemu
mkdir build && cd build
../configure --target-list=x86_64-softmmu --enable-bloodhound
make -j$(nproc)
cd ../..

# Verify installation
./target/release/bloodhound --version

Basic Usage

# Run with specific seed (deterministic)
bloodhound run --compose docker-compose.yml --seed 42

# Explore for bugs (coverage-guided)
bloodhound explore --compose docker-compose.yml --seeds 10000 --timeout 1h

# Debug a failure with time-travel debugging
bloodhound debug --seed 42 --gdb-port 1234

# Run as part of CI/CD
bloodhound test --compose docker-compose.yml --coverage-threshold 80%

Configuration

Create a bloodhound.yaml file in your project directory:

# Docker Compose file to test
compose: docker-compose.yml

# Simulation settings
simulation:
  max_time: 5m        # Maximum simulation time
  time_step: 10ms     # Time step granularity
  seeds: 1000         # Number of seeds to explore

# Workload configuration
workload:
  driver: http
  config:
    target: http://lb:80
    qps: 100
    duration: 60s
    operations:
      - type: put
        weight: 40
        key_pattern: "key-{random:1-10000}"
      - type: get
        weight: 50
      - type: delete
        weight: 10

# Properties to verify
properties:
  - name: no-data-loss
    kind: safety
    description: "Acknowledged writes must not be lost"
    check:
      type: linearizability
      operations: [put, get]

  - name: partition-recovery
    kind: liveness
    description: "Cluster must recover within 30s after partition heals"
    timeout: 30s
    check:
      type: http
      endpoint: http://lb:80/health
      expect:
        status: 200

# Fault injection
faults:
  network:
    drop_rate: 0.01
    delay_ms: 50
    delay_rate: 0.05
    partition_probability: 0.001
    partition_duration_ms: [1000, 10000]

  disk:
    write_fail_rate: 0.001
    partial_write_rate: 0.0005
    fsync_fail_rate: 0.001

  process:
    crash_probability: 0.0001
    pause_probability: 0.0005
    oom_probability: 0.00001

# Exploration settings
exploration:
  strategy: coverage-guided  # bfs, dfs, random, coverage-guided
  max_depth: 1000
  max_states: 100000
  parallel_workers: 8
  prioritize_coverage: true
  prioritize_near_violation: true

# Output settings
output:
  dir: ./output
  save_violation_traces: true
  save_interesting_seeds: true
  coverage_format: html
  summary_report: true

# Debugging
debug:
  gdb_enabled: true
  gdb_port: 1234
  trace_level: normal
  trace_events:
    - network_send
    - network_recv
    - disk_write
    - process_crash

Property Types

Safety Properties

Safety properties assert that "bad things never happen." They are checked continuously throughout execution.

- name: no-data-loss
  kind: safety
  check:
    type: linearizability
    operations: [put, get]

Liveness Properties

Liveness properties assert that "good things eventually happen." They include a timeout.

- name: leader-election
  kind: liveness
  timeout: 5s
  check:
    type: custom
    script: ./checks/has-leader.sh

Invariants

Invariants are properties that must hold at every state.

- name: single-leader-per-term
  kind: invariant
  check:
    type: custom
    script: ./checks/single-leader.sh

Fault Injection

Bloodhound can inject various types of faults deterministically:

Network Faults

Packet drop: Randomly drop network packets
Packet delay: Add latency to network communication
Packet corruption: Corrupt packet data
Network partitions: Isolate nodes from each other

Disk Faults

Write failures: Fail disk writes
Partial writes: Simulate torn writes (power failure during write)
fsync failures: Fail fsync calls
Read corruption: Return corrupted data on reads

Process Faults

Crashes: Kill processes (SIGKILL)
Pauses: Pause processes (SIGSTOP)
OOM kills: Simulate out-of-memory conditions

Time-Travel Debugging

When a bug is found, Bloodhound can replay the exact execution:

# Start debugging session
bloodhound debug --seed 42 --gdb-port 1234

# In another terminal, connect with GDB
gdb -ex "target remote :1234"

The debugger supports:

Step forward/backward: Navigate through execution
Breakpoints: Set breakpoints that work across time
Watchpoints: Watch variables change over time
Reverse execution: Step backward to find bug origins

CI/CD Integration

GitHub Actions

name: Bloodhound Tests
on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: datadog/bloodhound-action@v1
        with:
          compose: docker-compose.yml
          seeds: 1000
          timeout: 30m
          coverage-threshold: 80%

Examples

See the examples/ directory for complete examples:

redis-rust-demo/: Bug reproduction demo - Find and reproduce a CRDT consistency bug using async-VM mode. See Demo Documentation.
redis-rust/: Simple redis-rust cluster example
distributed-kv/: A distributed key-value store with Raft consensus
message-queue/: A distributed message queue
cache-cluster/: A distributed cache with consistent hashing

How It Works

Deterministic Hypervisor: Bloodhound uses a modified QEMU with TCG (Tiny Code Generator) mode to ensure deterministic execution. All sources of non-determinism (time, random numbers, I/O ordering) are controlled.
Virtual Time: Time is virtualized so simulations run faster than real-time while maintaining correct behavior. A 5-minute simulation might complete in seconds.
Snapshot Tree: Bloodhound maintains a tree of VM snapshots using copy-on-write, enabling efficient exploration of different execution paths from the same starting point.
Coverage-Guided Exploration: Like fuzzing, Bloodhound prioritizes execution paths that discover new code coverage, effi

Bloodhound

Install / Use

README

Bloodhound

Features

Architecture

Quick Start

Installation

Basic Usage

Configuration

Property Types

Safety Properties

Liveness Properties

Invariants

Fault Injection

Network Faults

Disk Faults

Process Faults

Time-Travel Debugging

CI/CD Integration

GitHub Actions

Examples

How It Works