Lemur
Latent Embedding Multivariate Regression
Install / Use
/learn @const-ae/LemurREADME
Latent Embedding Multivariate Regression (LEMUR)
<!-- badges: start --> <!-- badges: end --> <img src="man/figures/lemur-art.jpg" data-fig-align="center" />The goal of
lemur is
to simplify the analysis of multi-condition single-cell data. If you
have collected a single-cell RNA-seq dataset with more than one
condition, lemur predicts for each cell and gene what the expression
data would be in all of the other conditions. Furthermore, lemur finds
neighborhoods of cells that show consistent differential expression. The
results are statistically validated using a pseudo-bulk differential
expression test on hold-out data using
glmGamPoi or
edgeR.
lemur implements a novel framework to disentangle the effects of known
covariates, latent cell states, and their interactions. At the core, is
a combination of matrix factorization and regression analysis
implemented as geodesic regression on Grassmann manifolds. We call this
latent embedding multivariate regression. For more details see our
preprint.
Installation
You can install lemur directly from Bioconductor. Just paste the
following snippet into your R console:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("lemur")
Alternatively, you can install the package from Github using devtools:
devtools::install_github("const-ae/lemur")
A note to users
We continue working on improvements to this package. We are delighted if you decide to try out the package. Please use the Bioconductor forum or open an issue in the GitHub repository if you think you found a bug, have an idea for a cool feature, or have any questions about how LEMUR works.
Overview
A basic lemur workflow is as easy as the following.
# ... sce is a SingleCellExperiment object with your data
fit <- lemur(sce, design = ~ patient_id + condition, n_embedding = 15)
fit <- align_harmony(fit) # This step is optional
fit <- test_de(fit, contrast = cond(condition = "ctrl") - cond(condition = "panobinostat"))
nei <- find_de_neighborhoods(fit, group_by = vars(patient_id, condition))
We will now go through these steps one by one.
A worked through example
We demonstrate lemur using data by Zhao et
al. (2021). The data
consist of tumor biopsies from five glioblastomas, each of which was
treated with the drug panobinostat and with a control. Accordingly, we
look at ten samples in a paired experimental design.
We start by loading required packages.
library("tidyverse")
library("SingleCellExperiment")
library("lemur")
set.seed(42)
The lemur package ships with a reduced-size version of the
glioblastoma data, which we use in the following.
data("glioblastoma_example_data", package = "lemur")
glioblastoma_example_data
#> class: SingleCellExperiment
#> dim: 300 5000
#> metadata(0):
#> assays(2): counts logcounts
#> rownames(300): ENSG00000210082 ENSG00000118785 ... ENSG00000167468 ENSG00000139289
#> rowData names(6): gene_id symbol ... strand. source
#> colnames(5000): CGCCAGAGCGCA AGCTTTACTGCG ... TGAACAGTGCGT TGACCGGAATGC
#> colData names(10): patient_id treatment_id ... sample_id id
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
As is, the data separated by the known covariates patient_id and
condition.
orig_umap <- uwot::umap(as.matrix(t(logcounts(glioblastoma_example_data))))
as_tibble(colData(glioblastoma_example_data)) |>
mutate(umap = orig_umap) |>
ggplot(aes(x = umap[,1], y = umap[,2])) +
geom_point(aes(color = patient_id, shape = condition), size = 0.5) +
labs(title = "UMAP of logcounts") + coord_fixed()
<div class="figure" style="text-align: center">
<img src="man/figures/README-fig-raw_umap-1.png" alt="UMAP of the example dataset `glioblastoma_example_data`, prior to any between-condition alignment." width="80%" />
<p class="caption">
UMAP of the example dataset `glioblastoma_example_data`, prior to any
between-condition alignment.
</p>
</div>
We fit the LEMUR model by calling the function lemur. We provide the
experimental design using a formula. The elements of the formula can
refer to columns of the colData of the SingleCellExperiment object.
We also set the number of latent dimensions, n_embedding, which has a
similar interpretation as the number of dimensions in PCA. The
test_fraction argument sets the fraction of cells which are
exclusively used to test for differential expression and not for
inferring the LEMUR parameters. It balances the sensitivity to detect
subtle patterns in the latent space against the power to detect
differentially expressed genes.
fit <- lemur(glioblastoma_example_data, design = ~ patient_id + condition,
n_embedding = 15, test_fraction = 0.5)
fit
#> class: lemur_fit
#> dim: 300 5000
#> metadata(9): n_embedding design ... use_assay row_mask
#> assays(2): counts logcounts
#> rownames(300): ENSG00000210082 ENSG00000118785 ... ENSG00000167468 ENSG00000139289
#> rowData names(6): gene_id symbol ... strand. source
#> colnames(5000): CGCCAGAGCGCA AGCTTTACTGCG ... TGAACAGTGCGT TGACCGGAATGC
#> colData names(10): patient_id treatment_id ... sample_id id
#> reducedDimNames(2): linearFit embedding
#> mainExpName: NULL
#> altExpNames(0):
The lemur function returns an object of class lemur_fit, which extends
the SingleCellExperiment class. It supports subsetting and all the
usual accessor methods (e.g., nrow, assay, colData, rowData). In
addition, lemur overloads the $ operator to allow easy access to
additional fields produced by the LEMUR model. For example, the
low-dimensional embedding can be accessed using fit$embedding:
fit$embedding |> str()
#> num [1:15, 1:5000] 8.7543 0.0759 -1.0401 4.8462 0.529 ...
#> - attr(*, "dimnames")=List of 2
#> ..$ : NULL
#> ..$ : chr [1:5000] "CGCCAGAGCGCA" "AGCTTTACTGCG" "AGGATGACCGCA" "AGAACTATTTTT" ...
Optionally, we can further align corresponding cells using manually
annotated cell types (align_by_grouping) or an automated alignment
procedure (e.g., align_harmony). This ensures that corresponding cells
are close to each other in the fit$embedding.
fit <- align_harmony(fit)
#> Select cells that are considered close with 'harmony'
<!--FIXME "Select cells that are considered close with 'harmony'" -- should it not be 'Selecting' to avoid confusion with imperative verb form? Is the message even needed? -->
@fig-lemur_umap shows a UMAP of fit$embedding. This is similar to
working on the integrated PCA space in a traditional single-cell
analysis.
umap <- uwot::umap(t(fit$embedding))
as_tibble(fit$colData) |>
mutate(umap = umap) |>
ggplot(aes(x = umap[,1], y = umap[,2])) +
geom_point(aes(color = patient_id), size = 0.5) +
facet_wrap(vars(condition)) + coord_fixed()
<div class="figure" style="text-align: center">
<img src="man/figures/README-fig-lemur_umap-1.png" alt="UMAPs of `fit$embedding`. The points are shown separately for the two conditions, but reside in the same latent space." width="100%" />
<p class="caption">
UMAPs of `fit$embedding`. The points are shown separately for the two
conditions, but reside in the same latent space.
</p>
</div>
Next, let’s predict the effect of the panobinostat treatment for
each cell and each gene – even for the cells that were observed in
the control condition. The test_de function takes a lemur_fit object
and returns the object with a new slot (in SummarizedExperiment
parlance: assay) called DE. This slot contains the predicted
logarithmic fold changes between the two conditions specified in
contrast. Note that lemur implements a special notation for
contrasts. Instead of providing a contrast vector or design matrix
column names, you provide for each condition the levels, and lemur
automatically forms the contrast vector. This is intended to make the
notation more readable.
fit <- test_de(fit, contrast = cond(condition = "panobinostat") - cond(condition = "ctrl"))
We can pick any gene, say GAP43, which in our data is represented by its Ensembl gene ID ENSG00000172020, and show its differential expression pattern on the UMAP plot:
df <- tibble(umap = umap) |>
mutate(de = assay(fit, "DE")["ENSG00000172020", ])
ggplot(df, aes(x = umap[,1], y = umap[,2])) +
geom_point(aes(color = de)) +
scale_color_gradient2(low = "#FFD800", high= "#0056B9") + coord_fixed()
ggplot(df, aes(x = de)) + geom_histogram(bins = 100)
<div class="figure" style="text-align: center">
<img src="man/figures/README-fig-umap_de-1.png" alt="Differential expression (log fold changes) of GAP43" width="100%" /><img src="man/figures/README-fig-umap_de-2.png" alt="Differential expression (log fold changes) of GAP43" width="100%" />
<p class="caption"> Differential expression (log fold changes) of GAP43 </p> </div>More systematically, we can now search thro
Related Skills
node-connect
343.1kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
90.0kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
343.1kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
343.1kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
