RcppML
Rcpp Machine Learning: Fast robust NMF, divisive clustering, and more
Install / Use
/learn @zdebruine/RcppMLREADME
RcppML
Vignettes here: https://zdebruine.github.io/RcppML/articles/getting-started.html
RcppML is an R package for high-performance Non-negative Matrix Factorization (NMF), truncated SVD/PCA, and divisive clustering of large sparse and dense matrices. It supports six statistical distributions via IRLS, cross-validation for automatic rank selection, optional CUDA GPU acceleration, out-of-core streaming for datasets larger than memory, and a composable graph DSL for multi-modal and deep NMF.
Overview
RcppML decomposes a matrix A into lower-rank non-negative factors:
A ≈ W · diag(d) · H
where W (features × k) and H (k × samples) have columns/rows scaled to unit sum via a diagonal d, ensuring interpretable, scale-invariant factors.
Key capabilities:
- Fast NMF with alternating least squares, coordinate descent or Cholesky NNLS, and OpenMP parallelism via the Eigen C++ library
- Six statistical distributions — Gaussian (MSE), Generalized Poisson, Negative Binomial, Gamma, Inverse Gaussian, Tweedie — fitted via Iteratively Reweighted Least Squares (IRLS)
- Cross-validation with speckled holdout masks for principled rank selection
- GPU acceleration via CUDA (cuBLAS/cuSPARSE) with automatic fallback to CPU
- Streaming NMF from StreamPress (
.spz) files for datasets that exceed available memory - FactorNet graph API for composable multi-modal, deep, and branching NMF pipelines
- Truncated SVD with five methods: deflation, Krylov, Lanczos, IRLBA, randomized
- Divisive clustering via recursive rank-2 factorizations
Installation
# Stable release from CRAN
install.packages("RcppML")
# Development version from GitHub (requires Rcpp and RcppEigen)
devtools::install_github("zdebruine/RcppML")
GPU support (optional): Requires CUDA Toolkit ≥ 11.0 installed on the system.
See vignette("gpu-acceleration") for build instructions.
Quick Start
library(RcppML)
# Load a built-in gene expression dataset
data(aml) # 824 genomic regions × 135 samples (dense matrix)
# Run NMF with rank k = 6
model <- nmf(aml, k = 6, seed = 42)
model
#> nmf model
#> k = 6
#> w: 824 x 6
#> d: 6
#> h: 6 x 135
# Inspect factor loadings
head(model@w)
model@d
model@h[, 1:5]
# Reconstruction error
evaluate(model, aml)
# Project onto new data
H_new <- predict(model, aml[, 1:50])
# Plot the model summary
plot(model)
Statistical Distributions
RcppML goes beyond standard Frobenius-norm NMF by supporting distribution-appropriate loss functions via IRLS. This is critical for count data (scRNA-seq, text mining) where Gaussian assumptions are wrong.
| loss= | Distribution | Variance V(μ) | Use Case |
|---------|-------------|----------------|----------|
| "mse" | Gaussian | constant | Dense/general data (default) |
| "gp" | Generalized Poisson | μ(1+θ)² | Overdispersed counts |
| "nb" | Negative Binomial | μ + μ²/r | scRNA-seq, quadratic variance-mean |
| "gamma" | Gamma | μ² | Positive continuous data |
| "inverse_gaussian" | Inverse Gaussian | μ³ | Heavy right-tailed positive |
| "tweedie" | Tweedie | μ^p | Flexible power-law variance |
Example: Negative Binomial NMF for scRNA-seq
data(aml)
# NB is the natural choice for count data
model_nb <- nmf(aml, k = 6, loss = "nb")
# Fine-grained control via ... parameters
model_nb <- nmf(aml, k = 6,
loss = "nb",
nb_size_init = 10, # Initial size parameter r
nb_size_max = 1e6) # Upper bound for r
# Automatic distribution selection via BIC
auto <- auto_nmf_distribution(aml, k = 6)
auto$best # best distribution name
auto$comparison # BIC/AIC table for each distribution
Zero-Inflation Models
For data with excess zeros (e.g., droplet-based scRNA-seq), zero-inflated GP and NB models estimate structural dropout probabilities separately from the count model:
# Zero-inflated Negative Binomial with per-row dropout
model_zinb <- nmf(aml, k = 6, loss = "nb", zi = "row")
# Per-column dropout estimation
model <- nmf(aml, k = 6, loss = "nb", zi = "col")
Cross-Validation
Cross-validation uses a speckled holdout mask — a random subset of matrix entries are held out during training, and test error is computed on these entries. This enables principled rank selection.
data(aml)
# Sweep k = 2 through 20, three replicates per rank
cv <- nmf(aml, k = 2:20, test_fraction = 0.1, cv_seed = 1:3)
# Visualize test error vs rank
plot(cv)
# Automatic rank selection (binary search)
model_auto <- nmf(aml, k = "auto")
model_auto@misc$rank_search$k_optimal
Cross-Validation Options
cv <- nmf(data, k = 2:15,
test_fraction = 0.1, # 10% holdout
mask = "zeros", # Only non-zero entries in test set (for sparse data)
patience = 5, # Early stopping patience
cv_seed = c(42, 123, 7)) # Multiple seeds for replicates
Regularization
RcppML supports a comprehensive suite of regularization methods that can be combined freely:
L1 (LASSO) — Sparsity
# Symmetric L1 on both W and H
model <- nmf(data, k = 10, L1 = 0.1)
# Asymmetric: sparse H (embedding), dense W (loadings)
model <- nmf(data, k = 10, L1 = c(0, 0.2))
L2 (Ridge) — Shrinkage
model <- nmf(data, k = 10, L2 = c(0.01, 0.01))
L21 (Group Sparsity) — Factor Selection
L21-norm regularization drives entire rows of W or columns of H toward zero, effectively performing automatic factor selection:
model <- nmf(data, k = 20, L21 = c(0.1, 0))
# Some factors in W will be entirely zeroed out
Angular — Decorrelation
Encourages orthogonality between factors:
model <- nmf(data, k = 10, angular = c(0.1, 0.1))
Graph Laplacian — Spatial/Network Smoothness
If features or samples have a known graph structure (e.g., gene regulatory network, spatial coordinates), graph regularization encourages connected nodes to have similar factor representations:
# gene_graph: m × m sparse adjacency matrix
model <- nmf(data, k = 10, graph_W = gene_graph, graph_lambda = 0.5)
# Both feature and sample graphs
model <- nmf(data, k = 10,
graph_W = gene_graph, graph_H = cell_graph,
graph_lambda = c(0.5, 0.3))
Upper Bound Constraints
# Box constraint: all entries in W and H between 0 and 1
model <- nmf(data, k = 10, upper_bound = c(1, 1))
Robust NMF (Huber Loss)
# Huber-type robustness (less sensitive to outliers)
model <- nmf(data, k = 10, robust = TRUE)
# Custom Huber delta
model <- nmf(data, k = 10, robust = 2.0)
# MAE (L1 loss)
model <- nmf(data, k = 10, robust = "mae")
GPU Acceleration
RcppML optionally uses CUDA for GPU-accelerated NMF, delivering 10–20× speedups on large matrices.
# Check GPU availability
gpu_available()
#> [1] TRUE
gpu_info()
#> $device_name: "NVIDIA A100"
#> $total_memory_mb: 40960
# Sparse NMF on GPU
model <- nmf(data, k = 20, resource = "gpu")
# Dense NMF on GPU
model <- nmf(as.matrix(data), k = 20, resource = "gpu")
# Auto-dispatch (default): uses GPU if available, falls back to CPU
model <- nmf(data, k = 20, resource = "auto")
The GPU backend supports:
- Standard NMF (sparse and dense)
- Cross-validation NMF
- MSE loss (GPU-native), all other losses via automatic CPU fallback
- OpenMP + CUDA hybrid execution
- Zero-copy NMF with
st_read_gpu()for pre-loaded GPU data
Streaming Large Data (StreamPress .spz Files)
For datasets that exceed available RAM, RcppML streams data from StreamPress (.spz) compressed files — a column-oriented binary format with 10–20× compression via rANS entropy coding.
Writing and Reading SPZ Files
library(RcppML)
data(pbmc3k)
# Write sparse matrix to .spz file
spz_path <- tempfile(fileext = ".spz")
st_write(pbmc3k, spz_path, include_transpose = TRUE)
# Read back into memory
mat <- st_read(spz_path)
identical(dim(mat), dim(pbmc3k)) # TRUE
# File size comparison
file.size(spz_path) # Much smaller than RDS/RDA
Streaming NMF
# NMF directly from .spz file — data never fully loaded into memory
model <- nmf(spz_path, k = 10)
# Streaming cross-validation
cv <- nmf(spz_path, k = 2:15, test_fraction = 0.1)
# Streaming with non-MSE distributions
model <- nmf(spz_path, k = 10, loss = "nb")
Streaming NMF processes data in column panels with double-buffered asynchronous I/O, maintaining O(m·k + n·k + chunk) memory regardless of total matrix size.
FactorNet Graph API
The factor_net() API composes complex factorization pipelines from simple building blocks. Multi-modal, deep, and branching NMF networks are expressed as directed graphs:
Multi-Modal NMF
Share a common embedding H across two data matrices (e.g., RNA + ATAC from the same cells):
# Define inputs
inp_rna <- factor_input(rna_matrix, "rna")
inp_atac <- factor_input(atac_matrix, "atac")
# Create shared input node
shared <- factor_shared(inp_rna, inp_atac)
# Build NMF layer with per-factor regularization
layer <- shared |> nmf_layer(k = 10, name = "shared",
W = W(L1 = 0.1),
H = H(L1 = 0.05))
# Compile and fit the network
net <- factor_net(
inputs = list(inp_rna, inp_atac),
output = layer,
config = factor_config(maxit = 100, seed = 42))
result <- fit(net)
# Access results (layer name = "shared")
result$shared$W$rna # RNA feature loadings
result$shared$W$atac # ATAC feature loadings
result$shared$H # Shared cell embedding (k
