SkillAgentSearch skills...

GenomicDataStream

Read genomic data files (VCF, BCF, BGEN, PGEN, BED, H5AD, HDF5, DelayedArray) into R/Rcpp in chunks

Install / Use

/learn @GabrielHoffman/GenomicDataStream
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<br>

A scalable interface between genomic data and analysis underneath R

<div align="justify"> Reading genomic data files (<a href="https://www.ebi.ac.uk/training/online/courses/human-genetic-variation-introduction/variant-identification-and-analysis/understanding-vcf-format/">VCF</a>, <a href="https://samtools.github.io/bcftools/howtos/index.html">BCF</a>, <a href="https://www.chg.ox.ac.uk/~gav/bgen_format/index.html">BGEN</a>, <a href="https://www.cog-genomics.org/plink/2.0/input#pgen">PGEN</a>, <a href="https://www.cog-genomics.org/plink/2.0/input#bed">BED</a>, <a href="https://anndata.readthedocs.io/en/latest/index.html">H5AD</a>, <a href="https://en.wikipedia.org/wiki/Hierarchical_Data_Format">HDF5</a>, <a href="https://bioconductor.org/packages/DelayedArray">DelayedArray</a>) into R/Rcpp in chunks for analysis with <nobr><a href="https://doi.org/10.21105/joss.00026">Armadillo</a></nobr> / <a href="https://eigen.tuxfamily.org/index.php?title=Main_Page">Eigen</a> / <a href="https://www.rcpp.org">Rcpp</a> libraries. Modern datasets are often too big to fit into memory, and many analyses <nobr>operate</nobr> on a small chunk features at a time. Yet in practice, many implementations require the whole dataset stored in memory. Others pair an analysis with a specific data format in a way that the two components can't be separated for use in other applications. For example, regression analysis paired with genotype data from a VCF file.

The GenomicDataStream interface separates:

  1. data source
  2. streaming chunks of features into a data matrix
  3. downstream analysis

GenomicDataStream provides interfaces at both the C++ and R levels. The C++ interface prioritizes efficiency, while the R interface wraps the C++ backend for non-technical users.

</div>

See header-only C++ library documentation

Install

# Install latest version of GenomicDataStream and dependencies
BiocManager::install("GabrielHoffman/GenomicDataStream")

Supported formats

Genetic data

| Format | Version | Support | | -- | --- | --------- | | BGEN | 1.1 | biallelic variants | BGEN |1.2, 1.3| phased or unphased biallelic variants | PGEN | plink2 | biallelic variants | BED | plink1 | biallelic variants | VCF / BCF | 4.x | biallelic variants with GT/GP fields, continuous dosage with DS field

Single cell data

<div align="justify"> Count matrices for single cell data are stored in the H5AD format. This format, based on <a href="https://en.wikipedia.org/wiki/Hierarchical_Data_Format">HDF5</a>, can store millions of cells since it is designed for sparse counts (i.e. many entries are 0) and uses built-in compression. H5AD enables file-backed random access for analyzing a subset of the data without reading the entire file in to memory. </div>

Key Dependencies

| Package | Ref | Role | | - | --- | --------- | vcfppR | Bioinformatics | C++ API for htslib | htslib | GigaScience | C API for VCF/BCF files | pgenlibr | GigaScience | R/C++ API for plink files | beatchmat | PLoS Comp Biol | C++ API for access data owned by R | DelayedArray | | R interface for handling on-disk data formats | Rcpp| J Stat Software | API for R/C++ integration RcppEigen | J Stat Software | API for Rcpp access to Eigen matrix library RcppArmadillo| J Stat Software | API for Rcpp access to Armadillo matrix library Eigen | |C++ library for linear algebra with advanced features Armadillo | J Open Src Soft | User-friendly C++ library for linear algebra RcppParallel | | oneAPI Threading Building Blocks for parallel analysis

View on GitHub
GitHub Stars14
CategoryDevelopment
Updated2mo ago
Forks2

Languages

C

Security Score

75/100

Audited on Jan 20, 2026

No findings