CuratedAtlasQueryR
Tidy R query API for the harmonised and curated CELLxGENE single-cell atlas.
Install / Use
/learn @stemangiola/CuratedAtlasQueryRREADME
CuratedAtlasQueryR
<!-- badges: start --> <!-- badges: end -->CuratedAtlasQuery is a query interface that allow the programmatic
exploration and retrieval of the harmonised, curated and reannotated
CELLxGENE single-cell human cell atlas. Data can be retrieved at cell,
sample, or dataset levels based on filtering criteria.
Harmonised data is stored in the ARDC Nectar Research Cloud, and most
CuratedAtlasQuery functions interact with Nectar via web requests, so
a network connection is required for most functionality.
<img src="man/figures/svcf_logo.jpeg" width="155x" height="58px" /><img src="man/figures/czi_logo.png" width="129px" height="58px" /><img src="man/figures/bioconductor_logo.jpg" width="202px" height="58px" /><img src="man/figures/vca_logo.png" width="219px" height="58px" /><img src="man/figures/nectar_logo.png" width="180px" height="58px" />
Usage
The API has delivered more than 15Tb of data to the community in the first year. Thanks!
<img src="man/figures/downloads.png" width="40%" />Query interface
Installation
devtools::install_github("stemangiola/CuratedAtlasQueryR")
Load the package
library(CuratedAtlasQueryR)
Load and explore the metadata
Load the metadata
# Note: in real applications you should use the default value of remote_url
metadata <- get_metadata(remote_url = METADATA_URL)
metadata
#> # Source: table</vast/scratch/users/milton.m/cache/R/CuratedAtlasQueryR/metadata.0.2.3.parquet> [?? x 56]
#> # Database: DuckDB 0.7.1 [unknown@Linux 3.10.0-1160.88.1.el7.x86_64:R 4.2.1/:memory:]
#> cell_ sample_ cell_…¹ cell_…² confi…³ cell_…⁴ cell_…⁵ cell_…⁶ sampl…⁷ _samp…⁸
#> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <chr>
#> 1 8387… 7bd7b8… natura… immune… 5 cd8 tem gmp natura… 842ce7… Q59___…
#> 2 1768… 7bd7b8… natura… immune… 5 cd8 tem cd8 tcm natura… 842ce7… Q59___…
#> 3 6329… 7bd7b8… natura… immune… 5 cd8 tem clp termin… 842ce7… Q59___…
#> 4 5027… 7bd7b8… natura… immune… 5 cd8 tem clp natura… 842ce7… Q59___…
#> 5 7956… 7bd7b8… natura… immune… 5 cd8 tem clp natura… 842ce7… Q59___…
#> 6 4305… 7bd7b8… natura… immune… 5 cd8 tem clp termin… 842ce7… Q59___…
#> 7 2126… 933f96… natura… ilc 1 nk nk natura… c250bf… AML3__…
#> 8 3114… 933f96… natura… immune… 5 mait nk natura… c250bf… AML3__…
#> 9 1407… 933f96… natura… immune… 5 mait clp natura… c250bf… AML3__…
#> 10 2911… 933f96… natura… nk 5 nk clp natura… c250bf… AML3__…
#> # … with more rows, 46 more variables: assay <chr>,
#> # assay_ontology_term_id <chr>, file_id_db <chr>,
#> # cell_type_ontology_term_id <chr>, development_stage <chr>,
#> # development_stage_ontology_term_id <chr>, disease <chr>,
#> # disease_ontology_term_id <chr>, ethnicity <chr>,
#> # ethnicity_ontology_term_id <chr>, experiment___ <chr>, file_id <chr>,
#> # is_primary_data_x <chr>, organism <chr>, organism_ontology_term_id <chr>, …
The metadata variable can then be re-used for all subsequent queries.
Explore the tissue
metadata |>
dplyr::distinct(tissue, file_id)
#> # Source: SQL [10 x 2]
#> # Database: DuckDB 0.7.1 [unknown@Linux 3.10.0-1160.88.1.el7.x86_64:R 4.2.1/:memory:]
#> tissue file_id
#> <chr> <chr>
#> 1 bone marrow 1ff5cbda-4d41-4f50-8c7e-cbe4a90e38db
#> 2 lung parenchyma 6661ab3a-792a-4682-b58c-4afb98b2c016
#> 3 respiratory airway 6661ab3a-792a-4682-b58c-4afb98b2c016
#> 4 nose 6661ab3a-792a-4682-b58c-4afb98b2c016
#> 5 renal pelvis dc9d8cdd-29ee-4c44-830c-6559cb3d0af6
#> 6 kidney dc9d8cdd-29ee-4c44-830c-6559cb3d0af6
#> 7 renal medulla dc9d8cdd-29ee-4c44-830c-6559cb3d0af6
#> 8 cortex of kidney dc9d8cdd-29ee-4c44-830c-6559cb3d0af6
#> 9 kidney blood vessel dc9d8cdd-29ee-4c44-830c-6559cb3d0af6
#> 10 lung a2796032-d015-40c4-b9db-835207e5bd5b
Download single-cell RNA sequencing counts
Query raw counts
single_cell_counts =
metadata |>
dplyr::filter(
ethnicity == "African" &
stringr::str_like(assay, "%10x%") &
tissue == "lung parenchyma" &
stringr::str_like(cell_type, "%CD4%")
) |>
get_single_cell_experiment()
#> ℹ Realising metadata.
#> ℹ Synchronising files
#> ℹ Downloading 0 files, totalling 0 GB
#> ℹ Reading files.
#> ℹ Compiling Single Cell Experiment.
single_cell_counts
#> # A SingleCellExperiment-tibble abstraction: 1,571 × 57
#> # [90mFeatures=36229 | Cells=1571 | Assays=counts[0m
#> .cell sample_ cell_…¹ cell_…² confi…³ cell_…⁴ cell_…⁵ cell_…⁶ sampl…⁷ X_sam…⁸
#> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <chr>
#> 1 AGCG… 11a7dc… CD4-po… cd4 th1 3 cd4 tcm cd8 t th1 10b339… Donor_…
#> 2 TCAG… 11a7dc… CD4-po… cd4 th… 3 cd4 tcm cd4 tem th1/th… 10b339… Donor_…
#> 3 TTTA… 11a7dc… CD4-po… cd4 th… 3 cd4 tcm cd4 tcm th17 10b339… Donor_…
#> 4 ACAC… 11a7dc… CD4-po… immune… 5 cd4 tcm plasma th1/th… 10b339… Donor_…
#> 5 CAAG… 11a7dc… CD4-po… immune… 1 cd4 tcm cd4 tcm mait 10b339… Donor_…
#> 6 CTGT… 14a078… CD4-po… cd4 th… 3 cd4 tcm cd4 tem th1/th… 8f71c5… VUHD85…
#> 7 ACGT… 14a078… CD4-po… treg 5 cd4 tcm tregs t regu… 8f71c5… VUHD85…
#> 8 CATA… 14a078… CD4-po… immune… 5 nk cd8 tem mait 8f71c5… VUHD85…
#> 9 ACTT… 14a078… CD4-po… mait 5 mait cd8 tem mait 8f71c5… VUHD85…
#> 10 TGCG… 14a078… CD4-po… cd4 th1 3 cd4 tcm cd4 tem th1 8f71c5… VUHD85…
#> # … with 1,561 more rows, 47 more variables: assay <chr>,
#> # assay_ontology_term_id <chr>, file_id_db <chr>,
#> # cell_type_ontology_term_id <chr>, development_stage <chr>,
#> # development_stage_ontology_term_id <chr>, disease <chr>,
#> # disease_ontology_term_id <chr>, ethnicity <chr>,
#> # ethnicity_ontology_term_id <chr>, experiment___ <chr>, file_id <chr>,
#> # is_primary_data_x <chr>, organism <chr>, organism_ontology_term_id <chr>, …
Query counts scaled per million
This is helpful if just few genes are of interest, as they can be compared across samples.
single_cell_counts =
metadata |>
dplyr::filter(
ethnicity == "African" &
stringr::str_like(assay, "%10x%") &
tissue == "lung parenchyma" &
stringr::str_like(cell_type, "%CD4%")
) |>
get_single_cell_experiment(assays = "cpm")
#> ℹ Realising metadata.
#> ℹ Synchronising files
#> ℹ Downloading 0 files, totalling 0 GB
#> ℹ Reading files.
#> ℹ Compiling Single Cell Experiment.
single_cell_counts
#> # A SingleCellExperiment-tibble abstraction: 1,571 × 57
#> # [90mFeatures=36229 | Cells=1571 | Assays=cpm[0m
#> .cell sample_ cell_…¹ cell_…² confi…³ cell_…⁴ cell_…⁵ cell_…⁶ sampl…⁷ X_sam…⁸
#> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <chr>
#> 1 AGCG… 11a7dc… CD4-po… cd4 th1 3 cd4 tcm cd8 t th1 10b339… Donor_…
#> 2 TCAG… 11a7dc… CD4-po… cd4 th… 3 cd4 tcm cd4 tem th1/th… 10b339… Donor_…
#> 3 TTTA… 11a7dc… CD4-po… cd4 th… 3 cd4 tcm cd4 tcm th17 10b339… Donor_…
#> 4 ACAC… 11a7dc… CD4-po… immune… 5 cd4 tcm plasma th1/th… 10b339… Donor_…
#> 5 CAAG… 11a7dc… CD4-po… immune… 1 cd4 tcm cd4 tcm mait 10b339… Donor_…
#> 6 CTGT… 14a078… CD4-po… cd4 th… 3 cd4 tcm cd4 tem th1/th… 8f71c5… VUHD85…
#> 7 ACGT… 14a078… CD4-po… treg 5 cd4 tcm tregs t regu… 8f71c5… VUHD85…
#> 8 CATA… 14a078… CD4-po… immune… 5 nk cd8 tem mait 8f71c5… VUHD85…
#> 9 ACTT… 14a078… CD4-po… mait 5 mait cd8 tem mait 8f71c5… VUHD85…
#> 10 TGCG… 14a078… CD4-po… cd4 th1 3 cd4 tcm cd4 tem th1 8f71c5… VUHD85…
#> # … with 1,561 more rows, 47 more variables: assay <chr>,
#> # assay_ontology_term_id <chr>, file_id_db <chr>,
#> # cell_type_ontology_term_id <chr>, development_stage <chr>,
#> # development_stage_ontology_term_id <chr>, disease <chr>,
#> # disease_ontology_term_id <chr>, ethnicity <chr>,
#> # ethnicity_ontology_term_id <chr>, experiment___ <chr>, file_id <chr>,
#> # is_primary_data_x <chr>, organism <chr>, organism_ontology_term_id <chr>, …
Extract only a subset of genes
single_cell_counts =
metadata |>
dplyr::filter(
ethnicity == "African" &
stringr::str_like(assay, "%10x%") &
tissue == "lung parenchyma" &
stringr::str_like(cell_type, "%CD4%")
) |>
get_single_cell_experiment(assays = "cpm", features = "PUM1")
#> ℹ Realising metadata.
#> ℹ Synchronising files
#> ℹ Downloading 0 files, totalling 0 GB
#> ℹ Reading files.
#> ℹ Compiling Single Cell Experiment.
single_cell_counts
#> # A SingleCellExperiment-tibble abstraction: 1,571 × 57
#> # [90mFeatures=1 | Cells=1571 | Assays=cpm[0m
#> .cell sample_ cell_…¹ cell_…² confi…³ cell_…⁴ cell_…⁵ cell_…⁶ sampl…⁷ X_sam…⁸
#> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <chr>
#> 1 AGCG… 11a7dc… CD4-po… cd4 th1 3 cd4 tcm cd8 t th1 10b339… Donor_…
#> 2 TCAG… 11a7dc… CD4-po… cd4 th… 3 cd4 tcm cd4 tem th1/th… 10b339… Donor_…
#> 3 TTTA… 11a7dc… CD4-po… cd4 th… 3 cd4 tcm cd4 tcm th17 10b339… Donor_…
#> 4 ACAC… 11a7dc… CD4-po… immune… 5 cd4 tcm plasma th1/th… 10b339… Donor_…
#> 5 CAAG… 11a7dc… CD4-po… immune… 1 cd4 tcm cd4 tcm mait 10b339… Donor_…
#> 6 CTGT… 14a078… CD4-po… cd4 th… 3 cd4 tcm cd4 tem th1/th… 8f71c5… VUHD85…
#> 7 ACGT… 14a078… CD4-po… treg 5 cd4 tcm tregs t regu… 8f71c5… VUHD85…
#> 8 CATA… 14a078… CD4-po… immune… 5 nk cd8 tem mait
