SkillAgentSearch skills...

RcppML

Rcpp Machine Learning: Fast robust NMF, divisive clustering, and more

Install / Use

/learn @zdebruine/RcppML
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

RcppML

CRAN status CRAN downloads License: GPL v3

Vignettes here: https://zdebruine.github.io/RcppML/articles/getting-started.html

RcppML is an R package for high-performance Non-negative Matrix Factorization (NMF), truncated SVD/PCA, and divisive clustering of large sparse and dense matrices. It supports six statistical distributions via IRLS, cross-validation for automatic rank selection, optional CUDA GPU acceleration, out-of-core streaming for datasets larger than memory, and a composable graph DSL for multi-modal and deep NMF.


Overview

RcppML decomposes a matrix A into lower-rank non-negative factors:

A ≈ W · diag(d) · H

where W (features × k) and H (k × samples) have columns/rows scaled to unit sum via a diagonal d, ensuring interpretable, scale-invariant factors.

Key capabilities:

  • Fast NMF with alternating least squares, coordinate descent or Cholesky NNLS, and OpenMP parallelism via the Eigen C++ library
  • Six statistical distributions — Gaussian (MSE), Generalized Poisson, Negative Binomial, Gamma, Inverse Gaussian, Tweedie — fitted via Iteratively Reweighted Least Squares (IRLS)
  • Cross-validation with speckled holdout masks for principled rank selection
  • GPU acceleration via CUDA (cuBLAS/cuSPARSE) with automatic fallback to CPU
  • Streaming NMF from StreamPress (.spz) files for datasets that exceed available memory
  • FactorNet graph API for composable multi-modal, deep, and branching NMF pipelines
  • Truncated SVD with five methods: deflation, Krylov, Lanczos, IRLBA, randomized
  • Divisive clustering via recursive rank-2 factorizations

Installation

# Stable release from CRAN
install.packages("RcppML")

# Development version from GitHub (requires Rcpp and RcppEigen)
devtools::install_github("zdebruine/RcppML")

GPU support (optional): Requires CUDA Toolkit ≥ 11.0 installed on the system. See vignette("gpu-acceleration") for build instructions.


Quick Start

library(RcppML)

# Load a built-in gene expression dataset
data(aml)  # 824 genomic regions × 135 samples (dense matrix)

# Run NMF with rank k = 6
model <- nmf(aml, k = 6, seed = 42)
model
#> nmf model
#>   k = 6
#>   w: 824 x 6
#>   d: 6
#>   h: 6 x 135

# Inspect factor loadings
head(model@w)
model@d
model@h[, 1:5]

# Reconstruction error
evaluate(model, aml)

# Project onto new data
H_new <- predict(model, aml[, 1:50])

# Plot the model summary
plot(model)

Statistical Distributions

RcppML goes beyond standard Frobenius-norm NMF by supporting distribution-appropriate loss functions via IRLS. This is critical for count data (scRNA-seq, text mining) where Gaussian assumptions are wrong.

| loss= | Distribution | Variance V(μ) | Use Case | |---------|-------------|----------------|----------| | "mse" | Gaussian | constant | Dense/general data (default) | | "gp" | Generalized Poisson | μ(1+θ)² | Overdispersed counts | | "nb" | Negative Binomial | μ + μ²/r | scRNA-seq, quadratic variance-mean | | "gamma" | Gamma | μ² | Positive continuous data | | "inverse_gaussian" | Inverse Gaussian | μ³ | Heavy right-tailed positive | | "tweedie" | Tweedie | μ^p | Flexible power-law variance |

Example: Negative Binomial NMF for scRNA-seq

data(aml)

# NB is the natural choice for count data
model_nb <- nmf(aml, k = 6, loss = "nb")

# Fine-grained control via ... parameters
model_nb <- nmf(aml, k = 6,
                loss = "nb",
                nb_size_init = 10,     # Initial size parameter r
                nb_size_max = 1e6)     # Upper bound for r

# Automatic distribution selection via BIC
auto <- auto_nmf_distribution(aml, k = 6)
auto$best       # best distribution name
auto$comparison  # BIC/AIC table for each distribution

Zero-Inflation Models

For data with excess zeros (e.g., droplet-based scRNA-seq), zero-inflated GP and NB models estimate structural dropout probabilities separately from the count model:

# Zero-inflated Negative Binomial with per-row dropout
model_zinb <- nmf(aml, k = 6, loss = "nb", zi = "row")

# Per-column dropout estimation
model <- nmf(aml, k = 6, loss = "nb", zi = "col")

Cross-Validation

Cross-validation uses a speckled holdout mask — a random subset of matrix entries are held out during training, and test error is computed on these entries. This enables principled rank selection.

data(aml)

# Sweep k = 2 through 20, three replicates per rank
cv <- nmf(aml, k = 2:20, test_fraction = 0.1, cv_seed = 1:3)

# Visualize test error vs rank
plot(cv)

# Automatic rank selection (binary search)
model_auto <- nmf(aml, k = "auto")
model_auto@misc$rank_search$k_optimal

Cross-Validation Options

cv <- nmf(data, k = 2:15,
          test_fraction = 0.1,      # 10% holdout
          mask = "zeros",           # Only non-zero entries in test set (for sparse data)
          patience = 5,             # Early stopping patience
          cv_seed = c(42, 123, 7))  # Multiple seeds for replicates

Regularization

RcppML supports a comprehensive suite of regularization methods that can be combined freely:

L1 (LASSO) — Sparsity

# Symmetric L1 on both W and H
model <- nmf(data, k = 10, L1 = 0.1)

# Asymmetric: sparse H (embedding), dense W (loadings)
model <- nmf(data, k = 10, L1 = c(0, 0.2))

L2 (Ridge) — Shrinkage

model <- nmf(data, k = 10, L2 = c(0.01, 0.01))

L21 (Group Sparsity) — Factor Selection

L21-norm regularization drives entire rows of W or columns of H toward zero, effectively performing automatic factor selection:

model <- nmf(data, k = 20, L21 = c(0.1, 0))
# Some factors in W will be entirely zeroed out

Angular — Decorrelation

Encourages orthogonality between factors:

model <- nmf(data, k = 10, angular = c(0.1, 0.1))

Graph Laplacian — Spatial/Network Smoothness

If features or samples have a known graph structure (e.g., gene regulatory network, spatial coordinates), graph regularization encourages connected nodes to have similar factor representations:

# gene_graph: m × m sparse adjacency matrix
model <- nmf(data, k = 10, graph_W = gene_graph, graph_lambda = 0.5)

# Both feature and sample graphs
model <- nmf(data, k = 10,
             graph_W = gene_graph, graph_H = cell_graph,
             graph_lambda = c(0.5, 0.3))

Upper Bound Constraints

# Box constraint: all entries in W and H between 0 and 1
model <- nmf(data, k = 10, upper_bound = c(1, 1))

Robust NMF (Huber Loss)

# Huber-type robustness (less sensitive to outliers)
model <- nmf(data, k = 10, robust = TRUE)

# Custom Huber delta
model <- nmf(data, k = 10, robust = 2.0)

# MAE (L1 loss)
model <- nmf(data, k = 10, robust = "mae")

GPU Acceleration

RcppML optionally uses CUDA for GPU-accelerated NMF, delivering 10–20× speedups on large matrices.

# Check GPU availability
gpu_available()
#> [1] TRUE

gpu_info()
#> $device_name: "NVIDIA A100"
#> $total_memory_mb: 40960

# Sparse NMF on GPU
model <- nmf(data, k = 20, resource = "gpu")

# Dense NMF on GPU
model <- nmf(as.matrix(data), k = 20, resource = "gpu")

# Auto-dispatch (default): uses GPU if available, falls back to CPU
model <- nmf(data, k = 20, resource = "auto")

The GPU backend supports:

  • Standard NMF (sparse and dense)
  • Cross-validation NMF
  • MSE loss (GPU-native), all other losses via automatic CPU fallback
  • OpenMP + CUDA hybrid execution
  • Zero-copy NMF with st_read_gpu() for pre-loaded GPU data

Streaming Large Data (StreamPress .spz Files)

For datasets that exceed available RAM, RcppML streams data from StreamPress (.spz) compressed files — a column-oriented binary format with 10–20× compression via rANS entropy coding.

Writing and Reading SPZ Files

library(RcppML)
data(pbmc3k)

# Write sparse matrix to .spz file
spz_path <- tempfile(fileext = ".spz")
st_write(pbmc3k, spz_path, include_transpose = TRUE)

# Read back into memory
mat <- st_read(spz_path)
identical(dim(mat), dim(pbmc3k))  # TRUE

# File size comparison
file.size(spz_path)  # Much smaller than RDS/RDA

Streaming NMF

# NMF directly from .spz file — data never fully loaded into memory
model <- nmf(spz_path, k = 10)

# Streaming cross-validation
cv <- nmf(spz_path, k = 2:15, test_fraction = 0.1)

# Streaming with non-MSE distributions
model <- nmf(spz_path, k = 10, loss = "nb")

Streaming NMF processes data in column panels with double-buffered asynchronous I/O, maintaining O(m·k + n·k + chunk) memory regardless of total matrix size.


FactorNet Graph API

The factor_net() API composes complex factorization pipelines from simple building blocks. Multi-modal, deep, and branching NMF networks are expressed as directed graphs:

Multi-Modal NMF

Share a common embedding H across two data matrices (e.g., RNA + ATAC from the same cells):

# Define inputs
inp_rna  <- factor_input(rna_matrix, "rna")
inp_atac <- factor_input(atac_matrix, "atac")

# Create shared input node
shared <- factor_shared(inp_rna, inp_atac)

# Build NMF layer with per-factor regularization
layer <- shared |> nmf_layer(k = 10, name = "shared",
  W = W(L1 = 0.1),
  H = H(L1 = 0.05))

# Compile and fit the network
net <- factor_net(
  inputs = list(inp_rna, inp_atac),
  output = layer,
  config = factor_config(maxit = 100, seed = 42))

result <- fit(net)

# Access results (layer name = "shared")
result$shared$W$rna   # RNA feature loadings
result$shared$W$atac  # ATAC feature loadings
result$shared$H       # Shared cell embedding (k
View on GitHub
GitHub Stars114
CategoryEducation
Updated21h ago
Forks19

Languages

C++

Security Score

85/100

Audited on Apr 6, 2026

No findings