Rosalind

Deterministic genomics engine with a compact memory footprint. Run whole-genome workloads in as little as 100 MB RAM.

Rosalind is a Rust engine for genome alignment, streaming variant calling, and custom bioinformatics analytics that runs on commodity or edge hardware. It achieves O(√t) working memory, deterministic replay, and drop-in extensibility for new pipelines (Rust plugins or Python bindings). Traditional pipelines often assume 50-100+ gigabytes of RAM, well-provisioned data centers, and uninterrupted connectivity; Rosalind is designed for the opposite: hospital workstations, clinic laptops, field kits, and classrooms.

Quick Overview

Core problem: standard tools such as BWA, GATK, or cloud-centric workflows frequently require >50 GB RAM, full copies of intermediate files, and high-bandwidth storage, placing them out of reach in many hospitals, public-health labs, and teaching environments.
Rosalind’s answer: split workloads into √t blocks, reuse a rolling boundary between blocks, and evaluate a height-compressed tree so memory stays in L1/L2 cache while preserving deterministic results. The entire pipeline fits in well under 100 MB even for whole genomes.
How you use it: run the CLI, embed the Rust APIs, or extend via plugins/Python to build bespoke genomics workflows—ideal for quick-turnaround clinical diagnostics, outbreak monitoring, or courses where students explore real data on laptops.

See At a Glance, How It Compares, and What O(√t) memory means for deeper context.

At a Glance

O(√t) working memory – whole-genome runs stay under ~100 MB without lossy approximations.
End-to-end deterministic – outputs are bit-for-bit identical across runs and partition choices.
Full-history equivalent – recomputation keeps results identical to unbounded-memory evaluations.
Streaming SAM/BAM/VCF – standards-compliant outputs without materializing huge intermediates.
Edge-ready deployment – runs on 8–16 GB laptops/desktops so PHI stays on-site.
Composable extensions – plugins/Python bindings inherit the same memory and determinism guarantees.

Why Rosalind Matters

Clinical genomics on laptops – In many hospitals outside major research centers, the fastest available hardware is a shared desktop with 8–16 GB RAM. Rosalind lets clinicians align patient genomes and call variants in a single shift without renting cloud machines or shipping sensitive data off-site.
Outbreak monitoring at the edge – During field responses (Ebola, Zika, SARS-CoV-2), portable sequencers and laptops are common, but high-memory servers are not. Rosalind streams reads, calls variants on the fly, and tolerates intermittent connectivity, enabling decisions while still on-site.
Population-scale research – Universities or small labs may have access to mid-tier clusters but still need to process cohorts of thousands. Rosalind keeps per-sample memory flat, so more genomes can be processed in parallel on cost-efficient instances.
Education and outreach – Students can experiment with alignment, variant calling, and plugin development on personal machines, making computational genomics coursework more hands-on and equitable.
Custom analytics – Developers can plug in coverage dashboards, quality metrics, or single-cell style aggregations while benefiting from the same space guarantees, simplifying the path from prototype to production.

Key Guarantees

Space bound: total memory space_used ≤ block_size + num_blocks + O(log num_blocks) = O(√t); only the most recent block boundary is retained.
Deterministic replay: every block is re-simulated from the previous boundary, producing the same results as a full history, even on resource-constrained devices.
Composable design: block processors, plugins, and bindings use the same compressed evaluator, so new analyses inherit the guarantees—perfect for bespoke QC or epidemiology dashboards.
Guardrails included: regression tests (tests/space_bounds.rs) and scripts/run_scale_test.sh fail if the O(√t) or sublinear scaling properties regress.
Partition invariance: outputs are unchanged across valid choices of block size and chunking; merges are deterministic and order independent.
Full-history equivalence: results match an unbounded-history evaluation; the space savings come from recomputation, not information loss.

What O(√t) memory means (and what 't' is)

t ≈ total bases processed. For 30× human whole-genome sequencing: coverage C ≈ 30, genome size G ≈ 3.1×10⁹, so t ≈ C × G ≈ 9.3×10¹⁰.
√t ≈ 3.0×10⁵, which sets the block buffer size. The height-compressed merge stack is log₂(t) ≈ 36 levels—negligible compared with the block.
Working set ≈ (α + β) · √t + γ. With α ≈ 64–128 B per active position, whole-genome runs sit around 30–80 MB; even conservative assumptions keep the bound <100 MB.
The bound holds because only the current block, rolling boundary, and compact merge stack reside in memory; older state is recomputed on demand.
The FM-index/reference can be memory-mapped, so the O(√t) claim concerns Rosalind’s dynamic working set relative to dataset size.

Why This Is Different

Partition-invariant determinism – bit-for-bit identical outputs across runs, regardless of block size or partitioning; ideal for clinical audits, SOP lock-down, and incident investigation.
Strict, test-enforced O(√t) memory – whole-genome runs fit in well under 100 MB; CI gates fail if the bound regresses.
Full-history equivalence – recomputation (not truncation) guarantees identical results to an unbounded-memory evaluation.
True streaming reads → variants – no need to materialize large intermediates; reduces IO/storage pressure and time-to-first-result.
Standards without heavyweight infra – interoperable SAM/BAM/VCF emission while retaining streaming and memory advantages.
Cache-resident execution – keeping state in L1/L2 minimizes cache misses and paging on modest hardware, improving real-world throughput outside of large servers.
Composable extensions that inherit guarantees – plugins and Python bindings share the same compressed evaluator and workspace pool, preserving memory bounds and determinism.

Clinical Relevance

Keeps PHI on-site—runs comfortably on 8–16 GB hospital desktops and field laptops without cloud transfer.
Bit-for-bit reproducibility simplifies CAP/CLIA audits, SOP lock-down, and incident review.
Predictable <100 MB working set avoids OOMs and keeps shared schedulers/workstations stable.
Streaming-friendly operation tolerates intermittent connectivity and minimizes temp-file churn in the field.

How It Compares

| Capability | Rosalind | Typical Stack | | --- | --- | --- | | Peak RAM (WGS) | <100 MB working set; no multi-GB temp files | 1–16+ GB RAM plus large intermediates | | Determinism | Bit-for-bit identical outputs; partition invariant | Often varies with thread ordering or sharding | | Partition invariance | Validated in CI across block sizes | Repartitioning can alter outputs | | Streaming outputs | Reads → SAM/BAM → VCF without materializing huge files | Batch stages typically require full intermediates | | Standards | SAM/BAM/VCF with streaming-friendly pipeline | Standards supported, but streaming often limited | | On-prem edge viability | Runs on 8–16 GB laptops; PHI stays on-site | Assumes high-RAM servers or cloud resources | | Guardrails/tests | CI enforces O(√t), determinism, property tests | Unit tests common; resource/determinism guards rare |

Core Capabilities

FM-index Alignment – Blocked Burrows–Wheeler/FM-index search with per-block rank/select checkpoints uses only O(√t) working state, making it practical to align entire reference genomes on a mid-spec laptop.
Streaming Variant Calling – On-the-fly pileups with Bayesian scoring keep memory bounded while emitting variants live; ideal for remote surveillance, bedside genomics, or interactive notebooks.
Standards-compliant outputs – Interoperable SAM/BAM/VCF with streaming-friendly IO, minimizing large intermediate files and sort/index overhead.
Plugin & Python Ecosystem – Implement the GenomicPlugin trait or call into the PyO3 bindings to add custom analyses without duplicating memory—e.g., RNA expression summaries or coverage drop-out checks for diagnostics.
Rolling Boundary – During DFS evaluation only the latest block summary is retained; older summaries are discarded, guaranteeing O(b + T + log T) = O(√t).

How Rosalind Achieves O(√t) Space

Block decomposition – Workloads are partitioned into √t blocks with deterministic summaries; only one block’s buffer is in memory at a time, avoiding whole-read caches.
Height-compressed trees – Child summaries merge in an implicit tree of height O(log√t); pointerless DFS stores just 2 bits per level and recomputes endpoints on the fly.
Streaming ledger – Two bits per block track completion, preventing duplicate merges or stored intermediate trees.
Workspace pooling – A single reusable allocation is shared across components (alignment, pileups, plugins), avoiding allocator churn and preserving the bound.
Execution flow – (reads) → block alignment → rolling boundary update → tree merge → streaming outputs (variants, metrics, analytics). This pipeline mirrors typical aligner + variant-caller workflows, but with drastically lower memory.

Repository Layout

src/framework/ – Generic compressed evaluator and configuration helpers.
src/genomics/ – FM-index, alignment summaries, pileups, variant calling, and shared types.
src/plugin/ – Plugin trait, executor, registry, and example RNA-seq quantification plugin.
src/python_bindings/ – PyO3 bridge exp

Rosalind

Install / Use

README

Rosalind

Quick Overview

At a Glance

Why Rosalind Matters

Key Guarantees

What O(√t) memory means (and what 't' is)

Why This Is Different

Clinical Relevance

How It Compares

Core Capabilities

How Rosalind Achieves O(√t) Space

Repository Layout