Bloodhound
Deterministic simulation testing platform for hunting bugs in distributed systems
Install / Use
/learn @nerdsane/BloodhoundREADME
Bloodhound
A deterministic simulation testing platform for hunting bugs in distributed systems.
Bloodhound uses a modified QEMU hypervisor to provide perfect reproducibility for containerized applications, enabling systematic exploration of failure scenarios that are impossible to reproduce with traditional testing.
Version 0.2.0 - OCI image integration, registry auth, container auto-translation, property checking. See Release Notes.
Research Note: This project is co-authored with Claude (Anthropic) as an experiment in AI-assisted systems programming. See CLAUDE.md for project philosophy and AI collaboration guidelines.
Platform Support: Linux x86_64 fully tested with real VMs. macOS supports harness/simulation mode only. See Platform Support for details.
Features
- Deterministic Execution: Same seed produces identical execution every time
- Language Agnostic: Test any containerized application (Go, Rust, Java, Python, etc.)
- Fault Injection: Network partitions, disk failures, process crashes, clock skew
- Time-Travel Debugging: Full replay capability with GDB integration
- Coverage-Guided Exploration: Intelligent state space exploration
- Docker Compose Integration: Test multi-service stacks directly
Architecture
┌─────────────────────────────────────────────────────────────┐
│ BLOODHOUND │
│ (Modified QEMU Hypervisor) │
├─────────────────────────────────────────────────────────────┤
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Virtual Time │ │ Fault Inject │ │ State Snap │ │
│ │ (TSC, HPET) │ │ (Net, Disk) │ │ (CoW, Tree) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
├─────────────────────────────────────────────────────────────┤
│ GUEST VMs (Containers) │
│ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ │
│ │ web │ │ redis │ │postgres│ │ ... │ │
│ └────────┘ └────────┘ └────────┘ └────────┘ │
└─────────────────────────────────────────────────────────────┘
Quick Start
Installation
Option 1: Download Release Binary (Recommended)
# Linux x86_64
curl -LO https://github.com/nerdsane/bloodhound/releases/latest/download/bloodhound-linux-amd64.tar.gz
tar -xzf bloodhound-linux-amd64.tar.gz
sudo mv bloodhound /usr/local/bin/
# macOS Intel
curl -LO https://github.com/nerdsane/bloodhound/releases/latest/download/bloodhound-darwin-amd64.tar.gz
tar -xzf bloodhound-darwin-amd64.tar.gz
sudo mv bloodhound /usr/local/bin/
# macOS Apple Silicon
curl -LO https://github.com/nerdsane/bloodhound/releases/latest/download/bloodhound-darwin-arm64.tar.gz
tar -xzf bloodhound-darwin-arm64.tar.gz
sudo mv bloodhound /usr/local/bin/
# Verify installation
bloodhound --version
Option 2: Install from Source
# Using cargo
cargo install --git https://github.com/nerdsane/bloodhound
# Or build manually
git clone https://github.com/nerdsane/bloodhound.git
cd bloodhound
cargo build --release
Option 3: Full Setup with Real VMs (Linux Only)
# Clone the repository
git clone https://github.com/nerdsane/bloodhound.git
cd bloodhound
# Build the CLI
cargo build --release
# Build the deterministic kernel
./scripts/build-guest.sh
# Clone and build the patched QEMU hypervisor
git clone https://github.com/nerdsane/qemu.git
cd qemu
mkdir build && cd build
../configure --target-list=x86_64-softmmu --enable-bloodhound
make -j$(nproc)
cd ../..
# Verify installation
./target/release/bloodhound --version
Basic Usage
# Run with specific seed (deterministic)
bloodhound run --compose docker-compose.yml --seed 42
# Explore for bugs (coverage-guided)
bloodhound explore --compose docker-compose.yml --seeds 10000 --timeout 1h
# Debug a failure with time-travel debugging
bloodhound debug --seed 42 --gdb-port 1234
# Run as part of CI/CD
bloodhound test --compose docker-compose.yml --coverage-threshold 80%
Configuration
Create a bloodhound.yaml file in your project directory:
# Docker Compose file to test
compose: docker-compose.yml
# Simulation settings
simulation:
max_time: 5m # Maximum simulation time
time_step: 10ms # Time step granularity
seeds: 1000 # Number of seeds to explore
# Workload configuration
workload:
driver: http
config:
target: http://lb:80
qps: 100
duration: 60s
operations:
- type: put
weight: 40
key_pattern: "key-{random:1-10000}"
- type: get
weight: 50
- type: delete
weight: 10
# Properties to verify
properties:
- name: no-data-loss
kind: safety
description: "Acknowledged writes must not be lost"
check:
type: linearizability
operations: [put, get]
- name: partition-recovery
kind: liveness
description: "Cluster must recover within 30s after partition heals"
timeout: 30s
check:
type: http
endpoint: http://lb:80/health
expect:
status: 200
# Fault injection
faults:
network:
drop_rate: 0.01
delay_ms: 50
delay_rate: 0.05
partition_probability: 0.001
partition_duration_ms: [1000, 10000]
disk:
write_fail_rate: 0.001
partial_write_rate: 0.0005
fsync_fail_rate: 0.001
process:
crash_probability: 0.0001
pause_probability: 0.0005
oom_probability: 0.00001
# Exploration settings
exploration:
strategy: coverage-guided # bfs, dfs, random, coverage-guided
max_depth: 1000
max_states: 100000
parallel_workers: 8
prioritize_coverage: true
prioritize_near_violation: true
# Output settings
output:
dir: ./output
save_violation_traces: true
save_interesting_seeds: true
coverage_format: html
summary_report: true
# Debugging
debug:
gdb_enabled: true
gdb_port: 1234
trace_level: normal
trace_events:
- network_send
- network_recv
- disk_write
- process_crash
Property Types
Safety Properties
Safety properties assert that "bad things never happen." They are checked continuously throughout execution.
- name: no-data-loss
kind: safety
check:
type: linearizability
operations: [put, get]
Liveness Properties
Liveness properties assert that "good things eventually happen." They include a timeout.
- name: leader-election
kind: liveness
timeout: 5s
check:
type: custom
script: ./checks/has-leader.sh
Invariants
Invariants are properties that must hold at every state.
- name: single-leader-per-term
kind: invariant
check:
type: custom
script: ./checks/single-leader.sh
Fault Injection
Bloodhound can inject various types of faults deterministically:
Network Faults
- Packet drop: Randomly drop network packets
- Packet delay: Add latency to network communication
- Packet corruption: Corrupt packet data
- Network partitions: Isolate nodes from each other
Disk Faults
- Write failures: Fail disk writes
- Partial writes: Simulate torn writes (power failure during write)
- fsync failures: Fail fsync calls
- Read corruption: Return corrupted data on reads
Process Faults
- Crashes: Kill processes (SIGKILL)
- Pauses: Pause processes (SIGSTOP)
- OOM kills: Simulate out-of-memory conditions
Time-Travel Debugging
When a bug is found, Bloodhound can replay the exact execution:
# Start debugging session
bloodhound debug --seed 42 --gdb-port 1234
# In another terminal, connect with GDB
gdb -ex "target remote :1234"
The debugger supports:
- Step forward/backward: Navigate through execution
- Breakpoints: Set breakpoints that work across time
- Watchpoints: Watch variables change over time
- Reverse execution: Step backward to find bug origins
CI/CD Integration
GitHub Actions
name: Bloodhound Tests
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: datadog/bloodhound-action@v1
with:
compose: docker-compose.yml
seeds: 1000
timeout: 30m
coverage-threshold: 80%
Examples
See the examples/ directory for complete examples:
redis-rust-demo/: Bug reproduction demo - Find and reproduce a CRDT consistency bug using async-VM mode. See Demo Documentation.redis-rust/: Simple redis-rust cluster exampledistributed-kv/: A distributed key-value store with Raft consensusmessage-queue/: A distributed message queuecache-cluster/: A distributed cache with consistent hashing
How It Works
-
Deterministic Hypervisor: Bloodhound uses a modified QEMU with TCG (Tiny Code Generator) mode to ensure deterministic execution. All sources of non-determinism (time, random numbers, I/O ordering) are controlled.
-
Virtual Time: Time is virtualized so simulations run faster than real-time while maintaining correct behavior. A 5-minute simulation might complete in seconds.
-
Snapshot Tree: Bloodhound maintains a tree of VM snapshots using copy-on-write, enabling efficient exploration of different execution paths from the same starting point.
-
Coverage-Guided Exploration: Like fuzzing, Bloodhound prioritizes execution paths that discover new code coverage, effi
