HPCell

Compose single-cell and spatial analyses with pipes (|>) and leave HPCell scale them on high-performance and cloud, or locally.

HPCell is a grammar and workflow composer that allows to compose pipe-friendly single-cell and spatial pipelines, that are converted to fully integrated, dependency-based parallelised workflow, that can be easily deployed to HPC with no setup, and easily to cloud-computing.

The advent of advanced sequencing techniques, such as microfluidic, microwell, droplet-based methodologies, and spatial transcriptomics technologies, has significantly transformed the study of biological systems. These advancements have resulted in extensive human and mouse cell compendiums, necessitating scalable analysis pipelines for large-scale data. Single-cell RNA-seq pipelines, including Nextflow and Snakemake, provide robust solutions but require extensive customisation and scripting. Graphical workflow systems like Galaxy and KNIME, while user-friendly, often lack the scalability needed for large datasets. With tools like Bioconductor, Seurat, scater, and others, the R ecosystem offers comprehensive solutions for single-cell RNA sequencing analysis but faces limitations in parallelisation efficiency. The targets R ecosystem addresses these challenges by offering an integrated, high-performance, cloud-compatible solution that optimises computational resources and enhances workflow robustness. We introduce HPCell, a modular grammar based on tidy principles, utilising targets to parallelise single-cell and spatial R workflows efficiently. HPCell simplifies workflow management, improves scalability and reproducibility, and generates visual reports for enhanced data interpretability, significantly accelerating scientific discovery in the single-cell RNA sequencing community.

The key features of HPCell include:

Native R Pipeline: HPCell is developed to work natively within the R environment, enhancing usability for R users without requiring them to learn new, workflow-specific languages.
High Performance Computing Support: HPCell supports scaling for large datasets and enables parallel processing of tasks on High Performance Computing platforms.
Reproducibility and Consistency: The framework ensures reproducibility and consistent execution environments through automatic dependency generation.

Installation

remotes::install_github("MangiolaLaboratory/HPCell")

The input

The pipeline accepts a vector of file paths. If this vector is named, those names will be used for the reports. HPCell accepts a variety of data containers, including on-disk (which we recomend).

seurat_rds
seurat_h5
sce_rds
anndata
sce_hdf5

library(Seurat)
library(SeuratData)

options(Seurat.object.assay.version = "v5")
input_seurat <- 
  LoadData("pbmc3k") |>
  _[,1:500] 

file_path = "~/temp_seurat.rds"

input_seurat |> saveRDS(file_path)

# Let's pretend we have two samples
input_hpc =  c(file_path, file_path, file_path)

Pipeline

The power of HPCell is that despite you can compose your pipeline simply using the pipe (|>), it is evaluated as a dependency graph by target workflow manager. This has several advantages:

It has automatic report rendering for each step
It can parallelise across samples and across steps, improving performances. Our provided modules were designed to minimise dependencies between analyses. For example, doublet inference and reference-based cell annotations are run independently, while the filtering happens downstream.
Our provided medules, never save the whole object if not necessary, this optimise the use of on-disk memory. For example, the empty droplet, doublet and cell annotation labels, are stored in a data frame and integrated downstream, when needed.
It can accept growing sample set, and runs analyses only on the new samples. This is ideal for continuous integration
It can accept new analysis steps, and just runs the needed dependencies
When parameters of an analysis step are changed (e.g. a threshold), HPCell only runs the affected dependencies downstream. This is helpful when a change of parameters can affect downstream analyses, which are re-run automatically
The pipeline can be easily extended with the modules by the community

library(HPCell)
library(crew)

input_hpc |> 
  
  # Initialise pipeline characteristics
  initialise_hpc(
    gene_nomenclature = "symbol",
    data_container_type = "seurat_rds"
  ) |> 
  
  remove_empty_DropletUtils() |>          # Remove empty outliers
  remove_dead_scuttle() |>                # Remove dead cells
  score_cell_cycle_seurat() |>            # Score cell cycle
  remove_doublets_scDblFinder() |>        # Remove doublets
  annotate_cell_type() |>                 # Annotation across SingleR and Seurat Azimuth
  normalise_abundance_seurat_SCT(factors_to_regress = c(
    "subsets_Mito_percent", 
    "subsets_Ribo_percent", 
    "G2M.Score"
  )) |> 
  calculate_pseudobulk(group_by = "monaco_first.labels.fine")

Deployment

Local parallel computing

Your pipeline can be deployed locally using multiple cores. You can specify the resource allocation as computing_resources in initialise_hpc()

computing_resources = crew_controller_local(workers = 10)

SLURM HPC parallel computing

This resource interfaces with SLURM without the need of complex setups

library(crew.cluster)

computing_resources =
      crew.cluster::crew_controller_slurm(
        slurm_memory_gigabytes_per_cpu = 5,
        workers = 100, # max 100 jobs at the time are launched to the cluster
        tasks_max = 5 # Shots a worker after 5 tasks.
      )

Resource tiering

For large-scale analyses, where datasets vary in size, using a unique resource set can be inefficient. For example, few very large datasets might require 50 Gb of memory, while the majority might only require 5Gb. Asking 50Gb for all jobs would be inefficient (would lead to maximum usage per user, and limit the parallelisation), while asking 5Gb would lead to job failure.

You can provide an array of tier labels (of the same length as sample input vector), and the resources linked with those labels

# For two samples
input_hpc |> 
  
  # Initialise pipeline characteristics
  initialise_hpc(
    gene_nomenclature = "symbol",
    
    # We have three samples, to which we give two different resources
    data_container_type = "seurat_rds", tier = c("tier_1", "tier_1", "tier_2"), 
    
    # We specify the two resources
    computing_resources = list(

      crew_controller_slurm(
        name = "tier_1",
        slurm_memory_gigabytes_per_cpu = 5,
        workers = 50,
        tasks_max = 5
      ),
      crew_controller_slurm(
        name = "tier_2",
        slurm_memory_gigabytes_per_cpu = 50,
        workers = 10,
        tasks_max = 5
      )
  )

Extend HPCell and create new modules

HPCell offer module constructor that allow users and developers to build new models on the fly and/or contribute to the ecosystem.

Simple module

For example, let’s create a toy module, where we normalise the seurat datasets

input_hpc |> 
  
  # Initialise pipeline characteristics
  initialise_hpc(
    gene_nomenclature = "symbol",
    data_container_type = "seurat_rds"
  ) |> 
  
  hpc_iterate(
    user_function = NormalizeData |> quote(), # The function, quoted to not be evaluated on the spot
    object = "data_object" |> is_target(), # The argument to the function. `is_target()` declares the dependency.
    # Other arguments that are not a dependency can also be used
    target_output = "seurat_normalised", # The name of the output dependency
    packages = "Seurat" # Software packages needed for the execution
  )

tar_read(seurat_normalised)

Complex module

Now let’s create a more complex module, where we accept both single-cell experiment and Seurat objects

input_hpc |> 
  
  # Initialise pipeline characteristics
  initialise_hpc(
    gene_nomenclature = "symbol",
    data_container_type = "seurat_rds"
  ) |> 
  
  hpc_iterate(
    user_function = (function(x){
      
      if(x |> is("SingleCellExperiment"))
        x |> as.Seurat() |> NormalizeData()
      
      else if(x |> is("Seurat"))
        x |> NormalizeData()
      
      else warning("Data format not accepted")
      
    }) |> quote(), # The function, quoted to not be evaluated on the spot
    x = "data_object" |> is_target(), # The argument to the function. `is_target()` declares the dependency.
    target_output = "seurat_normalised", # The name of the output dependency
    packages = "Seurat" # Software packages needed for the execution
  )

tar_read(seurat_normalised)

Reports module

HPCell allows you to create reports of any combination of analysis results, that will be rendered once the necessary analyses will be completed.

Here we provide a toy example

input_hpc |> 
  
  # Initialise pipeline characteristics
  initialise_hpc(
    gene_nomenclature = "symbol",
    data_container_type = "seurat_rds"
  ) |> 
  
  hpc_report(
    "empty_report", # The name of the report output
    rmd_path = paste0(system.file(package = "HPCell"), "/rmd/test.Rmd"), # The path to the Rmd. In this case it is stored within the package
    empty_list = "empty_tbl" |> is_target(), # The results and targets needed for the report
    sample_names = "sample_names" |> is_target() # The results and targets needed for the report
  ) 

tar_read(empty_report)

Details on prebuilt steps for several popular methods

Filtering out empty droplets

Parameters 1. input_read_RNA_assay SingleCellExperiment object containing RNA assay data. 2. filter_empty_droplets Logical value indicating whether to filter the input data.

We filter empty droplets as they don’t represent cells, but include only ambient RN

HPCell

Install / Use

README

HPCell

Installation

The input

Pipeline

Deployment

Local parallel computing

SLURM HPC parallel computing

Resource tiering

Extend HPCell and create new modules

Simple module

Complex module

Reports module

Details on prebuilt steps for several popular methods

Filtering out empty droplets