SkillAgentSearch skills...

SparseMatrixStats

Implementation of the matrixStats API for sparse matrices

Install / Use

/learn @const-ae/SparseMatrixStats
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<!-- README.md is generated from README.Rmd. Please edit that file -->

sparseMatrixStats <a href='https://github.com/const-ae/sparseMatrixStats'><img src='man/figures/logo.png' align="right" height="209" /></a>

<!-- badges: start -->

codecov

<!-- badges: end -->

The goal of sparseMatrixStats is to make the API of matrixStats available for sparse matrices.

Installation

You can install the release version of sparseMatrixStats from BioConductor:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("sparseMatrixStats")

Alternatively, you can get the development version of the package from GitHub with:

# install.packages("devtools")
devtools::install_github("const-ae/sparseMatrixStats")

If you have trouble with the installation, see the end of the README.

Example

library(sparseMatrixStats)
#> Loading required package: MatrixGenerics
#> Loading required package: matrixStats
#> 
#> Attaching package: 'MatrixGenerics'
#> The following objects are masked from 'package:matrixStats':
#> 
#>     colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
#>     colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
#>     colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
#>     colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
#>     colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
#>     colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
#>     colWeightedMeans, colWeightedMedians, colWeightedSds,
#>     colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
#>     rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
#>     rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
#>     rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
#>     rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
#>     rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
#>     rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
#>     rowWeightedSds, rowWeightedVars
mat <- matrix(0, nrow=10, ncol=6)
mat[sample(seq_len(60), 4)] <- 1:4
# Convert dense matrix to sparse matrix
sparse_mat <- as(mat, "dgCMatrix")
sparse_mat
#> 10 x 6 sparse Matrix of class "dgCMatrix"
#>                  
#>  [1,] 4 . . . . .
#>  [2,] . . . . . .
#>  [3,] . . . . . .
#>  [4,] 2 . . . . .
#>  [5,] . . . . . .
#>  [6,] . . . . . .
#>  [7,] . . . . . 1
#>  [8,] . . . . . .
#>  [9,] . . . 3 . .
#> [10,] . . . . . .

The package provides an interface to quickly do common operations on the rows or columns. For example calculate the variance:

apply(mat, 2, var)
#> [1] 1.822222 0.000000 0.000000 0.900000 0.000000 0.100000
matrixStats::colVars(mat)
#> [1] 1.822222 0.000000 0.000000 0.900000 0.000000 0.100000
sparseMatrixStats::colVars(sparse_mat)
#> [1] 1.822222 0.000000 0.000000 0.900000 0.000000 0.100000

On this small example data, all methods are basically equally fast, but if we have a much larger dataset, the optimizations for the sparse data start to show.

I generate a dataset with 10,000 rows and 50 columns that is 99% empty

big_mat <- matrix(0, nrow=1e4, ncol=50)
big_mat[sample(seq_len(1e4 * 50), 5000)] <- rnorm(5000)
# Convert dense matrix to sparse matrix
big_sparse_mat <- as(big_mat, "dgCMatrix")

I use the bench package to benchmark the performance difference:

bench::mark(
  sparseMatrixStats=sparseMatrixStats::colVars(big_sparse_mat),
  matrixStats=matrixStats::colVars(big_mat),
  apply=apply(big_mat, 2, var)
)
#> # A tibble: 3 x 6
#>   expression             min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>        <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 sparseMatrixStats   37.3µs  42.71µs   20836.     2.93KB    14.6 
#> 2 matrixStats         1.48ms   1.65ms     584.    156.8KB     2.03
#> 3 apply              10.61ms  11.18ms      88.9    9.54MB    48.2

As you can see sparseMatrixStats is ca. 35 times fast than matrixStats, which in turn is 7 times faster than the apply() version.

API

The package now supports all functions from the matrixStats API for column sparse matrices (dgCMatrix). And thanks to the MatrixGenerics it can be easily integrated along-side matrixStats and DelayedMatrixStats. Note that the rowXXX() functions are called by transposing the input and calling the corresponding colXXX() function. Special optimized implementations are available for rowSums2(), rowMeans2(), and rowVars().

| Method | matrixStats | sparseMatrixStats | Notes | | :------------------- | :---------- | :---------------- | :--------------------------------------------------------------------------------------- | | colAlls() | ✔ | ✔ | | | colAnyMissings() | ✔ | ❌ | Not implemented because it is deprecated in favor of colAnyNAs() | | colAnyNAs() | ✔ | ✔ | | | colAnys() | ✔ | ✔ | | | colAvgsPerRowSet() | ✔ | ✔ | | | colCollapse() | ✔ | ✔ | | | colCounts() | ✔ | ✔ | | | colCummaxs() | ✔ | ✔ | | | colCummins() | ✔ | ✔ | | | colCumprods() | ✔ | ✔ | | | colCumsums() | ✔ | ✔ | | | colDiffs() | ✔ | ✔ | | | colIQRDiffs() | ✔ | ✔ | | | colIQRs() | ✔ | ✔ | | | colLogSumExps() | ✔ | ✔ | | | colMadDiffs() | ✔ | ✔ | | | colMads() | ✔ | ✔ | | | colMaxs() | ✔ | ✔ | | | colMeans2() | ✔ | ✔ | | | colMedians() | ✔ | ✔ | | | colMins() | ✔ | ✔ | | | colOrderStats() | ✔ | ✔ | | | colProds() | ✔ | ✔ | | | colQuantiles() | ✔ | ✔ | | | colRanges() | ✔ | ✔ | | | colRanks() | ✔ | ✔ | | | colSdDiffs() | ✔ | ✔ | | | colSds() | ✔ | ✔ | | | colsum() | ✔ | ❌ | Base R function | | colSums2() | ✔ | ✔ | | | colTabulates() | ✔ | ✔ | | | colVarDiffs() | ✔ | ✔

Related Skills

View on GitHub
GitHub Stars54
CategoryDevelopment
Updated5mo ago
Forks3

Languages

R

Security Score

77/100

Audited on Oct 3, 2025

No findings