SkillAgentSearch skills...

Lemur

Latent Embedding Multivariate Regression

Install / Use

/learn @const-ae/Lemur
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<!-- README.md is generated from README.Rmd. Please edit that file --> <!-- 'Stangle' analogon: knitr::purl("Introduction.qmd") -->

Latent Embedding Multivariate Regression (LEMUR)

<!-- badges: start --> <!-- badges: end --> <img src="man/figures/lemur-art.jpg" data-fig-align="center" />

The goal of lemur is to simplify the analysis of multi-condition single-cell data. If you have collected a single-cell RNA-seq dataset with more than one condition, lemur predicts for each cell and gene what the expression data would be in all of the other conditions. Furthermore, lemur finds neighborhoods of cells that show consistent differential expression. The results are statistically validated using a pseudo-bulk differential expression test on hold-out data using glmGamPoi or edgeR.

lemur implements a novel framework to disentangle the effects of known covariates, latent cell states, and their interactions. At the core, is a combination of matrix factorization and regression analysis implemented as geodesic regression on Grassmann manifolds. We call this latent embedding multivariate regression. For more details see our preprint.

<figure> <img src="man/figures/equation_schematic.png" alt="Schematic of the matrix decomposition at the core of LEMUR" /> <figcaption aria-hidden="true">Schematic of the matrix decomposition at the core of LEMUR</figcaption> </figure>

Installation

You can install lemur directly from Bioconductor. Just paste the following snippet into your R console:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("lemur")

Alternatively, you can install the package from Github using devtools:

devtools::install_github("const-ae/lemur")

A note to users

We continue working on improvements to this package. We are delighted if you decide to try out the package. Please use the Bioconductor forum or open an issue in the GitHub repository if you think you found a bug, have an idea for a cool feature, or have any questions about how LEMUR works.

Overview

A basic lemur workflow is as easy as the following.

# ... sce is a SingleCellExperiment object with your data 
fit <- lemur(sce, design = ~ patient_id + condition, n_embedding = 15)
fit <- align_harmony(fit)   # This step is optional
fit <- test_de(fit, contrast = cond(condition = "ctrl") - cond(condition = "panobinostat"))
nei <- find_de_neighborhoods(fit, group_by = vars(patient_id, condition))

We will now go through these steps one by one.

A worked through example

We demonstrate lemur using data by Zhao et al. (2021). The data consist of tumor biopsies from five glioblastomas, each of which was treated with the drug panobinostat and with a control. Accordingly, we look at ten samples in a paired experimental design.

We start by loading required packages.

library("tidyverse")
library("SingleCellExperiment")
library("lemur")
set.seed(42)

The lemur package ships with a reduced-size version of the glioblastoma data, which we use in the following.

data("glioblastoma_example_data", package = "lemur")
glioblastoma_example_data
#> class: SingleCellExperiment 
#> dim: 300 5000 
#> metadata(0):
#> assays(2): counts logcounts
#> rownames(300): ENSG00000210082 ENSG00000118785 ... ENSG00000167468 ENSG00000139289
#> rowData names(6): gene_id symbol ... strand. source
#> colnames(5000): CGCCAGAGCGCA AGCTTTACTGCG ... TGAACAGTGCGT TGACCGGAATGC
#> colData names(10): patient_id treatment_id ... sample_id id
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):

As is, the data separated by the known covariates patient_id and condition.

orig_umap <- uwot::umap(as.matrix(t(logcounts(glioblastoma_example_data))))

as_tibble(colData(glioblastoma_example_data)) |>
  mutate(umap = orig_umap) |>
  ggplot(aes(x = umap[,1], y = umap[,2])) +
    geom_point(aes(color = patient_id, shape = condition), size = 0.5) +
    labs(title = "UMAP of logcounts") + coord_fixed()
<div class="figure" style="text-align: center"> <img src="man/figures/README-fig-raw_umap-1.png" alt="UMAP of the example dataset `glioblastoma_example_data`, prior to any between-condition alignment." width="80%" /> <p class="caption"> UMAP of the example dataset `glioblastoma_example_data`, prior to any between-condition alignment. </p> </div>

We fit the LEMUR model by calling the function lemur. We provide the experimental design using a formula. The elements of the formula can refer to columns of the colData of the SingleCellExperiment object.

We also set the number of latent dimensions, n_embedding, which has a similar interpretation as the number of dimensions in PCA. The test_fraction argument sets the fraction of cells which are exclusively used to test for differential expression and not for inferring the LEMUR parameters. It balances the sensitivity to detect subtle patterns in the latent space against the power to detect differentially expressed genes.

<!--FIXME Do we need to bother users with this parameter here? Just use a reasonable default, and later have a section on "variations", like in the [DESeq2 vignette](https://www.bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#variations-to-the-standard-workflow) where such things are mentioned?-->
fit <- lemur(glioblastoma_example_data, design = ~ patient_id + condition, 
             n_embedding = 15, test_fraction = 0.5)
fit
#> class: lemur_fit 
#> dim: 300 5000 
#> metadata(9): n_embedding design ... use_assay row_mask
#> assays(2): counts logcounts
#> rownames(300): ENSG00000210082 ENSG00000118785 ... ENSG00000167468 ENSG00000139289
#> rowData names(6): gene_id symbol ... strand. source
#> colnames(5000): CGCCAGAGCGCA AGCTTTACTGCG ... TGAACAGTGCGT TGACCGGAATGC
#> colData names(10): patient_id treatment_id ... sample_id id
#> reducedDimNames(2): linearFit embedding
#> mainExpName: NULL
#> altExpNames(0):

The lemur function returns an object of class lemur_fit, which extends the SingleCellExperiment class. It supports subsetting and all the usual accessor methods (e.g., nrow, assay, colData, rowData). In addition, lemur overloads the $ operator to allow easy access to additional fields produced by the LEMUR model. For example, the low-dimensional embedding can be accessed using fit$embedding:

fit$embedding |> str()
#>  num [1:15, 1:5000] 8.7543 0.0759 -1.0401 4.8462 0.529 ...
#>  - attr(*, "dimnames")=List of 2
#>   ..$ : NULL
#>   ..$ : chr [1:5000] "CGCCAGAGCGCA" "AGCTTTACTGCG" "AGGATGACCGCA" "AGAACTATTTTT" ...

Optionally, we can further align corresponding cells using manually annotated cell types (align_by_grouping) or an automated alignment procedure (e.g., align_harmony). This ensures that corresponding cells are close to each other in the fit$embedding.

fit <- align_harmony(fit)
#> Select cells that are considered close with 'harmony'
<!--FIXME "Select cells that are considered close with 'harmony'" -- should it not be 'Selecting' to avoid confusion with imperative verb form? Is the message even needed? -->

@fig-lemur_umap shows a UMAP of fit$embedding. This is similar to working on the integrated PCA space in a traditional single-cell analysis.

umap <- uwot::umap(t(fit$embedding))

as_tibble(fit$colData) |>
  mutate(umap = umap) |>
  ggplot(aes(x = umap[,1], y = umap[,2])) +
    geom_point(aes(color = patient_id), size = 0.5) +
    facet_wrap(vars(condition)) + coord_fixed()
<div class="figure" style="text-align: center"> <img src="man/figures/README-fig-lemur_umap-1.png" alt="UMAPs of `fit$embedding`. The points are shown separately for the two conditions, but reside in the same latent space." width="100%" /> <p class="caption"> UMAPs of `fit$embedding`. The points are shown separately for the two conditions, but reside in the same latent space. </p> </div>

Next, let’s predict the effect of the panobinostat treatment for each cell and each gene – even for the cells that were observed in the control condition. The test_de function takes a lemur_fit object and returns the object with a new slot (in SummarizedExperiment parlance: assay) called DE. This slot contains the predicted logarithmic fold changes between the two conditions specified in contrast. Note that lemur implements a special notation for contrasts. Instead of providing a contrast vector or design matrix column names, you provide for each condition the levels, and lemur automatically forms the contrast vector. This is intended to make the notation more readable.

fit <- test_de(fit, contrast = cond(condition = "panobinostat") - cond(condition = "ctrl"))

We can pick any gene, say GAP43, which in our data is represented by its Ensembl gene ID ENSG00000172020, and show its differential expression pattern on the UMAP plot:

df <- tibble(umap = umap) |>
  mutate(de = assay(fit, "DE")["ENSG00000172020", ])
 
ggplot(df, aes(x = umap[,1], y = umap[,2])) +
    geom_point(aes(color = de)) +
    scale_color_gradient2(low = "#FFD800", high= "#0056B9") + coord_fixed()

ggplot(df, aes(x = de)) + geom_histogram(bins = 100)
<div class="figure" style="text-align: center">

<img src="man/figures/README-fig-umap_de-1.png" alt="Differential expression (log fold changes) of GAP43" width="100%" /><img src="man/figures/README-fig-umap_de-2.png" alt="Differential expression (log fold changes) of GAP43" width="100%" />

<p class="caption"> Differential expression (log fold changes) of GAP43 </p> </div>

More systematically, we can now search thro

Related Skills

View on GitHub
GitHub Stars100
CategoryDevelopment
Updated11d ago
Forks9

Languages

R

Security Score

80/100

Audited on Mar 20, 2026

No findings