RFX: Random Forests X

This work aims to honor the legacy of Leo Breiman and Adele Cutler by ensuring their Random Forest methodology is not forgotten and remains accessible to modern researchers.

RFX (Random Forests X) is a high-performance Python implementation of Breiman and Cutler's original Random Forest methodology with GPU acceleration. Provides complete interpretability: overall and local importance, proximity matrices, case-wise analysis, and interactive visualization. Scales proximity-based workflows to 200K+ samples (vs. ~60K limit) via QLORA compression (12,500× memory reduction).

Installation

Install from PyPI (Recommended)

GPU-Enabled Version (supports both GPU and CPU):

pip install rfx-ml

CPU-Only Version (lightweight, no CUDA dependencies):

pip install rfx-ml-cpu

Note: These packages are mutually exclusive. Both provide the rfx module. Choose based on your hardware:

Have a GPU and want acceleration? → rfx-ml
CPU-only system or want minimal dependencies? → rfx-ml-cpu

PyPI Packages: https://pypi.org/project/rfx-ml/ | https://pypi.org/project/rfx-ml-cpu/

Docker (Zero-Setup Installation)

Pre-built Docker images with all dependencies included:

GPU-Enabled Container (CUDA 12.8, supports both GPU and CPU):

# Pull from Docker Hub
docker pull ckuchar/rfx-gpu

# Run interactively
docker run --gpus all -it -v $(pwd):/workspace ckuchar/rfx-gpu

# Run with Jupyter Notebook
docker run --gpus all -p 8888:8888 -v $(pwd):/workspace ckuchar/rfx-gpu \
  jupyter notebook --ip=0.0.0.0 --allow-root

# Test GPU functionality
docker run --gpus all --rm ckuchar/rfx-gpu python3 /usr/local/bin/test_rfx.py

CPU-Only Container (lightweight, no CUDA required):

# Pull from Docker Hub
docker pull ckuchar/rfx-cpu

# Run interactively
docker run -it -v $(pwd):/workspace ckuchar/rfx-cpu

# Run with Jupyter Notebook
docker run -p 8888:8888 -v $(pwd):/workspace ckuchar/rfx-cpu \
  jupyter notebook --ip=0.0.0.0 --allow-root

Benefits:

No installation required - everything pre-configured
Reproducible environment across all systems
Isolated from host system
Includes MLflow, Jupyter, scikit-learn, pandas, plotly

Note: GPU container requires NVIDIA Container Toolkit for GPU access.

Dockerfiles: Available in integrations/docker/ for custom builds.

Prerequisites

Before installing, ensure you have:

CMake 3.12 or higher
Python 3.8+ (tested up to 3.13)
C++ compiler with C++17 support (GCC 7+, Clang 5+)
OpenMP (usually included with compiler)
CUDA toolkit 11.0+ (for GPU version only; not required for rfx-ml-cpu)

The pip install command will automatically build from source.

Install from Source (Development)

For development or installing from GitHub:

# Clone the repository
git clone https://github.com/chriskuchar/RFX.git
cd RFX

# Install with pip (handles CMake build automatically)
pip install -e .
pip install -e ".[viz,examples]"

The pip install command will automatically:

Configure CMake
Build the C++/CUDA extensions
Install the Python package

Manual Build (Alternative)

If you prefer to build manually:

# Create build directory
mkdir -p build && cd build

# Configure with CMake
cmake ..

# Build (uses all available CPU cores)
make -j$(nproc)

# Install Python package
cd ..
pip install -e .

Verify Installation

import rfx as rf
print(f"RFX version: {rf.__version__}")
print(f"CUDA enabled: {rf.__cuda_enabled__}")

Why RFX?

RFX provides comprehensive interpretability for Random Forests beyond prediction accuracy. Unlike scikit-learn or cuML, RFX implements Breiman & Cutler's complete analytical toolkit:

Unique Capabilities:

5× faster than scikit-learn - Efficient C++/CUDA implementation outperforms sklearn even with ALL interpretability features enabled
Local importance - Per-sample feature importance (like SHAP, but built-in)
Proximity matrices - Pairwise sample similarities for outlier detection, clustering, and visualization
Interactive visualization (rfviz) - 3D MDS, parallel coordinates, and linked brushing in Jupyter
GPU acceleration - Full CUDA support for trees, importance, and proximity (not just training)
QLORA compression - 12,500× memory reduction enabling analysis of 200K+ samples

Choose RFX when you need: Speed, interpretability, feature discovery, outlier detection, data exploration, or proximity-based analysis on large datasets.

| Feature | RFX | scikit-learn | cuML | randomForest (R) | |---------|-----|--------------|------|------------------| | Local importance (per-sample) | ✓ | ✗ | ✗ | ✓ | | Proximity matrices | ✓ | ✗ | ✗ | ✓ | | Interactive visualization | ✓ | ✗ | ✗ | ~ | | Full GPU acceleration | ✓ | ✗ | ~ | ✗ | | QLORA compression (12,500×) | ✓ | ✗ | ✗ | ✗ | | Scales to 200K+ samples | ✓ | ✗ | ✗ | ~60K |

Speed Comparison: RFX vs scikit-learn

Wine Dataset (178 samples, 13 features, 3 classes, 500 trees):

| Method | Mean Time (s) | Std Dev (s) | Trees/sec | OOB Accuracy | Speedup vs sklearn | |--------|---------------|-------------|-----------|--------------|-------------------| | RFX CPU | 0.098 | 0.000 | 5,114 | 0.9831 ± 0.0000 | 5.55× | | RFX GPU | 0.181 | 0.067 | 2,761 | 0.9794 ± 0.0026 | 3.00× | | scikit-learn | 0.544 | 0.001 | 919 | 0.9719 ± 0.0000 | 1.00× | | sklearn (no OOB) | 0.495 | 0.005 | 1,010 | N/A | 1.10× |

With Feature Importance:

| Method | Mean Time (s) | Std Dev (s) | Trees/sec | OOB Accuracy | Speedup vs sklearn | |--------|---------------|-------------|-----------|--------------|-------------------| | RFX CPU | 0.117 | 0.007 | 4,278 | 0.9831 ± 0.0000 | 5.05× | | RFX GPU | 0.328 | 0.008 | 1,525 | 0.9794 ± 0.0053 | 1.80× | | scikit-learn | 0.590 | 0.007 | 848 | 0.9719 ± 0.0000 | 1.00× |

With Local Importance: RFX computes local importance; sklearn does NOT have this feature

| Method | Mean Time (s) | Std Dev (s) | Trees/sec | OOB Accuracy | Speedup vs sklearn | |--------|---------------|-------------|-----------|--------------|-------------------| | RFX CPU | 0.124 | 0.004 | 4,020 | 0.9831 ± 0.0000 | 5.74× | | RFX GPU | 0.395 | 0.088 | 1,265 | 0.9813 ± 0.0026 | 1.81× | | scikit-learn | 0.714 | 0.011 | 700 | 0.9719 ± 0.0000 | 1.00× |

With FULL Proximity Matrix: RFX computes full proximity matrix; sklearn does NOT have this feature

| Method | Mean Time (s) | Std Dev (s) | Trees/sec | OOB Accuracy | Features | Speedup vs sklearn | |--------|---------------|-------------|-----------|--------------|----------|-------------------| | RFX CPU | 0.142 | 0.005 | 3,529 | 0.9831 ± 0.0000 | Full 178×178 proximity | 4.99× | | scikit-learn | 0.707 | 0.023 | 707 | 0.9719 ± 0.0000 | Basic only | 1.00× |

Notes:

RFX is 5× faster while computing features sklearn doesn't even have (local importance, proximity matrices)
sklearn does NOT support proximity matrices or local importance at all (these features don't exist)
Even with all these extra computations, RFX still outperforms sklearn's basic implementation
RFX computes OOB automatically; sklearn requires oob_score=True
Run examples/benchmark_rfx_vs_sklearn.py to reproduce these results on your system

Coming in v2.0: Regression and unsupervised learning modes.

Quick Start

Basic Classification

This comprehensive example demonstrates OOB evaluation, validation metrics, feature importance, and interactive visualization - all from a single model.

import numpy as np
import rfx as rf

# Feature names for Wine dataset
feature_names = [
    'Alcohol', 'Malic acid', 'Ash', 'Alcalinity of ash',
    'Magnesium', 'Total phenols', 'Flavanoids', 'Nonflavanoid phenols',
    'Proanthocyanins', 'Color intensity', 'Hue',
    'OD280/OD315 of diluted wines', 'Proline'
]

# Load Wine dataset (built-in)
X, y = rf.load_wine()

# Simple train/validation split (80/20)
np.random.seed(123)
indices = np.random.permutation(len(X))
n_train = int(0.8 * len(X))
train_idx, val_idx = indices[:n_train], indices[n_train:]
X_train, X_val = X[train_idx], X[val_idx]
y_train, y_val = y[train_idx], y[val_idx]

# Create and train model
model = rf.RandomForestClassifier(
    ntree=100,
    compute_importance=True,
    compute_local_importance=True,  # For instance-level explanations
    compute_proximity=True,  # For MDS visualization
    use_gpu=False  # Set to True for GPU acceleration
)

model.fit(X_train, y_train)

# ===== OOB Evaluation =====
oob_error = model.get_oob_error()
print(f"OOB Error: {oob_error:.4f}")
print(f"OOB Accuracy: {1 - oob_error:.4f}")

oob_pred = model.get_oob_predictions()
confusion = rf.confusion_matrix(y_train, oob_pred)
print("\nOOB Confusion Matrix:")
print(confusion)

# ===== Validation Set Evaluation =====
y_pred = model.predict(X_val)
val_accuracy = np.sum(y_val == y_pred) / len(y_val)
print(f"\nValidation Accuracy: {val_accuracy:.4f}")
print(rf.classification_report(y_val, y_pred))

# ===== Overall Feature Importance =====
importance = model.feature_importances_()
top_indices = np.argsort(importance)[-3:][::-1]
print("\

RFX

Install / Use

README