GenomicDataStream
Read genomic data files (VCF, BCF, BGEN, PGEN, BED, H5AD, HDF5, DelayedArray) into R/Rcpp in chunks
Install / Use
/learn @GabrielHoffman/GenomicDataStreamREADME
A scalable interface between genomic data and analysis underneath R

The GenomicDataStream interface separates:
- data source
- streaming chunks of features into a data matrix
- downstream analysis
GenomicDataStream provides interfaces at both the C++ and R levels. The C++ interface prioritizes efficiency, while the R interface wraps the C++ backend for non-technical users.
See header-only C++ library documentation
Install
# Install latest version of GenomicDataStream and dependencies
BiocManager::install("GabrielHoffman/GenomicDataStream")
Supported formats
Genetic data
| Format | Version | Support |
| -- | --- | --------- |
| BGEN | 1.1 | biallelic variants
| BGEN |1.2, 1.3| phased or unphased biallelic variants
| PGEN | plink2 | biallelic variants
| BED | plink1 | biallelic variants
| VCF / BCF | 4.x | biallelic variants with GT/GP fields, continuous dosage with DS field
Single cell data
<div align="justify"> Count matrices for single cell data are stored in the H5AD format. This format, based on <a href="https://en.wikipedia.org/wiki/Hierarchical_Data_Format">HDF5</a>, can store millions of cells since it is designed for sparse counts (i.e. many entries are 0) and uses built-in compression. H5AD enables file-backed random access for analyzing a subset of the data without reading the entire file in to memory. </div>Key Dependencies
| Package | Ref | Role | | - | --- | --------- | vcfppR | Bioinformatics | C++ API for htslib | htslib | GigaScience | C API for VCF/BCF files | pgenlibr | GigaScience | R/C++ API for plink files | beatchmat | PLoS Comp Biol | C++ API for access data owned by R | DelayedArray | | R interface for handling on-disk data formats | Rcpp| J Stat Software | API for R/C++ integration RcppEigen | J Stat Software | API for Rcpp access to Eigen matrix library RcppArmadillo| J Stat Software | API for Rcpp access to Armadillo matrix library Eigen | |C++ library for linear algebra with advanced features Armadillo | J Open Src Soft | User-friendly C++ library for linear algebra RcppParallel | | oneAPI Threading Building Blocks for parallel analysis
