CuratedAtlasQueryR

CuratedAtlasQuery is a query interface that allow the programmatic exploration and retrieval of the harmonised, curated and reannotated CELLxGENE single-cell human cell atlas. Data can be retrieved at cell, sample, or dataset levels based on filtering criteria.

Harmonised data is stored in the ARDC Nectar Research Cloud, and most CuratedAtlasQuery functions interact with Nectar via web requests, so a network connection is required for most functionality.

Usage

The API has delivered more than 15Tb of data to the community in the first year. Thanks!

Query interface

Installation

devtools::install_github("stemangiola/CuratedAtlasQueryR")

Load the package

library(CuratedAtlasQueryR)

Load and explore the metadata

Load the metadata

# Note: in real applications you should use the default value of remote_url 
metadata <- get_metadata(remote_url = METADATA_URL)
metadata
#> # Source:   table</vast/scratch/users/milton.m/cache/R/CuratedAtlasQueryR/metadata.0.2.3.parquet> [?? x 56]
#> # Database: DuckDB 0.7.1 [unknown@Linux 3.10.0-1160.88.1.el7.x86_64:R 4.2.1/:memory:]
#>    cell_ sample_ cell_…¹ cell_…² confi…³ cell_…⁴ cell_…⁵ cell_…⁶ sampl…⁷ _samp…⁸
#>    <chr> <chr>   <chr>   <chr>     <dbl> <chr>   <chr>   <chr>   <chr>   <chr>  
#>  1 8387… 7bd7b8… natura… immune…       5 cd8 tem gmp     natura… 842ce7… Q59___…
#>  2 1768… 7bd7b8… natura… immune…       5 cd8 tem cd8 tcm natura… 842ce7… Q59___…
#>  3 6329… 7bd7b8… natura… immune…       5 cd8 tem clp     termin… 842ce7… Q59___…
#>  4 5027… 7bd7b8… natura… immune…       5 cd8 tem clp     natura… 842ce7… Q59___…
#>  5 7956… 7bd7b8… natura… immune…       5 cd8 tem clp     natura… 842ce7… Q59___…
#>  6 4305… 7bd7b8… natura… immune…       5 cd8 tem clp     termin… 842ce7… Q59___…
#>  7 2126… 933f96… natura… ilc           1 nk      nk      natura… c250bf… AML3__…
#>  8 3114… 933f96… natura… immune…       5 mait    nk      natura… c250bf… AML3__…
#>  9 1407… 933f96… natura… immune…       5 mait    clp     natura… c250bf… AML3__…
#> 10 2911… 933f96… natura… nk            5 nk      clp     natura… c250bf… AML3__…
#> # … with more rows, 46 more variables: assay <chr>,
#> #   assay_ontology_term_id <chr>, file_id_db <chr>,
#> #   cell_type_ontology_term_id <chr>, development_stage <chr>,
#> #   development_stage_ontology_term_id <chr>, disease <chr>,
#> #   disease_ontology_term_id <chr>, ethnicity <chr>,
#> #   ethnicity_ontology_term_id <chr>, experiment___ <chr>, file_id <chr>,
#> #   is_primary_data_x <chr>, organism <chr>, organism_ontology_term_id <chr>, …

The metadata variable can then be re-used for all subsequent queries.

Explore the tissue

metadata |>
    dplyr::distinct(tissue, file_id) 
#> # Source:   SQL [10 x 2]
#> # Database: DuckDB 0.7.1 [unknown@Linux 3.10.0-1160.88.1.el7.x86_64:R 4.2.1/:memory:]
#>    tissue              file_id                             
#>    <chr>               <chr>                               
#>  1 bone marrow         1ff5cbda-4d41-4f50-8c7e-cbe4a90e38db
#>  2 lung parenchyma     6661ab3a-792a-4682-b58c-4afb98b2c016
#>  3 respiratory airway  6661ab3a-792a-4682-b58c-4afb98b2c016
#>  4 nose                6661ab3a-792a-4682-b58c-4afb98b2c016
#>  5 renal pelvis        dc9d8cdd-29ee-4c44-830c-6559cb3d0af6
#>  6 kidney              dc9d8cdd-29ee-4c44-830c-6559cb3d0af6
#>  7 renal medulla       dc9d8cdd-29ee-4c44-830c-6559cb3d0af6
#>  8 cortex of kidney    dc9d8cdd-29ee-4c44-830c-6559cb3d0af6
#>  9 kidney blood vessel dc9d8cdd-29ee-4c44-830c-6559cb3d0af6
#> 10 lung                a2796032-d015-40c4-b9db-835207e5bd5b

Download single-cell RNA sequencing counts

Query raw counts

single_cell_counts = 
    metadata |>
    dplyr::filter(
        ethnicity == "African" &
        stringr::str_like(assay, "%10x%") &
        tissue == "lung parenchyma" &
        stringr::str_like(cell_type, "%CD4%")
    ) |>
    get_single_cell_experiment()
#> ℹ Realising metadata.
#> ℹ Synchronising files
#> ℹ Downloading 0 files, totalling 0 GB
#> ℹ Reading files.
#> ℹ Compiling Single Cell Experiment.

single_cell_counts
#> # A SingleCellExperiment-tibble abstraction: 1,571 × 57
#> # [90mFeatures=36229 | Cells=1571 | Assays=counts[0m
#>    .cell sample_ cell_…¹ cell_…² confi…³ cell_…⁴ cell_…⁵ cell_…⁶ sampl…⁷ X_sam…⁸
#>    <chr> <chr>   <chr>   <chr>     <dbl> <chr>   <chr>   <chr>   <chr>   <chr>  
#>  1 AGCG… 11a7dc… CD4-po… cd4 th1       3 cd4 tcm cd8 t   th1     10b339… Donor_…
#>  2 TCAG… 11a7dc… CD4-po… cd4 th…       3 cd4 tcm cd4 tem th1/th… 10b339… Donor_…
#>  3 TTTA… 11a7dc… CD4-po… cd4 th…       3 cd4 tcm cd4 tcm th17    10b339… Donor_…
#>  4 ACAC… 11a7dc… CD4-po… immune…       5 cd4 tcm plasma  th1/th… 10b339… Donor_…
#>  5 CAAG… 11a7dc… CD4-po… immune…       1 cd4 tcm cd4 tcm mait    10b339… Donor_…
#>  6 CTGT… 14a078… CD4-po… cd4 th…       3 cd4 tcm cd4 tem th1/th… 8f71c5… VUHD85…
#>  7 ACGT… 14a078… CD4-po… treg          5 cd4 tcm tregs   t regu… 8f71c5… VUHD85…
#>  8 CATA… 14a078… CD4-po… immune…       5 nk      cd8 tem mait    8f71c5… VUHD85…
#>  9 ACTT… 14a078… CD4-po… mait          5 mait    cd8 tem mait    8f71c5… VUHD85…
#> 10 TGCG… 14a078… CD4-po… cd4 th1       3 cd4 tcm cd4 tem th1     8f71c5… VUHD85…
#> # … with 1,561 more rows, 47 more variables: assay <chr>,
#> #   assay_ontology_term_id <chr>, file_id_db <chr>,
#> #   cell_type_ontology_term_id <chr>, development_stage <chr>,
#> #   development_stage_ontology_term_id <chr>, disease <chr>,
#> #   disease_ontology_term_id <chr>, ethnicity <chr>,
#> #   ethnicity_ontology_term_id <chr>, experiment___ <chr>, file_id <chr>,
#> #   is_primary_data_x <chr>, organism <chr>, organism_ontology_term_id <chr>, …

Query counts scaled per million

This is helpful if just few genes are of interest, as they can be compared across samples.

single_cell_counts = 
    metadata |>
    dplyr::filter(
        ethnicity == "African" &
        stringr::str_like(assay, "%10x%") &
        tissue == "lung parenchyma" &
        stringr::str_like(cell_type, "%CD4%")
    ) |>
    get_single_cell_experiment(assays = "cpm")
#> ℹ Realising metadata.
#> ℹ Synchronising files
#> ℹ Downloading 0 files, totalling 0 GB
#> ℹ Reading files.
#> ℹ Compiling Single Cell Experiment.

single_cell_counts
#> # A SingleCellExperiment-tibble abstraction: 1,571 × 57
#> # [90mFeatures=36229 | Cells=1571 | Assays=cpm[0m
#>    .cell sample_ cell_…¹ cell_…² confi…³ cell_…⁴ cell_…⁵ cell_…⁶ sampl…⁷ X_sam…⁸
#>    <chr> <chr>   <chr>   <chr>     <dbl> <chr>   <chr>   <chr>   <chr>   <chr>  
#>  1 AGCG… 11a7dc… CD4-po… cd4 th1       3 cd4 tcm cd8 t   th1     10b339… Donor_…
#>  2 TCAG… 11a7dc… CD4-po… cd4 th…       3 cd4 tcm cd4 tem th1/th… 10b339… Donor_…
#>  3 TTTA… 11a7dc… CD4-po… cd4 th…       3 cd4 tcm cd4 tcm th17    10b339… Donor_…
#>  4 ACAC… 11a7dc… CD4-po… immune…       5 cd4 tcm plasma  th1/th… 10b339… Donor_…
#>  5 CAAG… 11a7dc… CD4-po… immune…       1 cd4 tcm cd4 tcm mait    10b339… Donor_…
#>  6 CTGT… 14a078… CD4-po… cd4 th…       3 cd4 tcm cd4 tem th1/th… 8f71c5… VUHD85…
#>  7 ACGT… 14a078… CD4-po… treg          5 cd4 tcm tregs   t regu… 8f71c5… VUHD85…
#>  8 CATA… 14a078… CD4-po… immune…       5 nk      cd8 tem mait    8f71c5… VUHD85…
#>  9 ACTT… 14a078… CD4-po… mait          5 mait    cd8 tem mait    8f71c5… VUHD85…
#> 10 TGCG… 14a078… CD4-po… cd4 th1       3 cd4 tcm cd4 tem th1     8f71c5… VUHD85…
#> # … with 1,561 more rows, 47 more variables: assay <chr>,
#> #   assay_ontology_term_id <chr>, file_id_db <chr>,
#> #   cell_type_ontology_term_id <chr>, development_stage <chr>,
#> #   development_stage_ontology_term_id <chr>, disease <chr>,
#> #   disease_ontology_term_id <chr>, ethnicity <chr>,
#> #   ethnicity_ontology_term_id <chr>, experiment___ <chr>, file_id <chr>,
#> #   is_primary_data_x <chr>, organism <chr>, organism_ontology_term_id <chr>, …

Extract only a subset of genes

single_cell_counts = 
    metadata |>
    dplyr::filter(
        ethnicity == "African" &
        stringr::str_like(assay, "%10x%") &
        tissue == "lung parenchyma" &
        stringr::str_like(cell_type, "%CD4%")
    ) |>
    get_single_cell_experiment(assays = "cpm", features = "PUM1")
#> ℹ Realising metadata.
#> ℹ Synchronising files
#> ℹ Downloading 0 files, totalling 0 GB
#> ℹ Reading files.
#> ℹ Compiling Single Cell Experiment.

single_cell_counts
#> # A SingleCellExperiment-tibble abstraction: 1,571 × 57
#> # [90mFeatures=1 | Cells=1571 | Assays=cpm[0m
#>    .cell sample_ cell_…¹ cell_…² confi…³ cell_…⁴ cell_…⁵ cell_…⁶ sampl…⁷ X_sam…⁸
#>    <chr> <chr>   <chr>   <chr>     <dbl> <chr>   <chr>   <chr>   <chr>   <chr>  
#>  1 AGCG… 11a7dc… CD4-po… cd4 th1       3 cd4 tcm cd8 t   th1     10b339… Donor_…
#>  2 TCAG… 11a7dc… CD4-po… cd4 th…       3 cd4 tcm cd4 tem th1/th… 10b339… Donor_…
#>  3 TTTA… 11a7dc… CD4-po… cd4 th…       3 cd4 tcm cd4 tcm th17    10b339… Donor_…
#>  4 ACAC… 11a7dc… CD4-po… immune…       5 cd4 tcm plasma  th1/th… 10b339… Donor_…
#>  5 CAAG… 11a7dc… CD4-po… immune…       1 cd4 tcm cd4 tcm mait    10b339… Donor_…
#>  6 CTGT… 14a078… CD4-po… cd4 th…       3 cd4 tcm cd4 tem th1/th… 8f71c5… VUHD85…
#>  7 ACGT… 14a078… CD4-po… treg          5 cd4 tcm tregs   t regu… 8f71c5… VUHD85…
#>  8 CATA… 14a078… CD4-po… immune…       5 nk      cd8 tem mait

CuratedAtlasQueryR

Install / Use

README

CuratedAtlasQueryR

Usage

Query interface

Installation

Load the package

Load and explore the metadata

Load the metadata

Explore the tissue

Download single-cell RNA sequencing counts

Query raw counts

Query counts scaled per million

Extract only a subset of genes