Metacells 0.9.5 - Single-cell RNA Sequencing Analysis

.. image:: https://readthedocs.org/projects/metacells/badge/?version=latest :target: https://metacells.readthedocs.io/en/latest/?badge=latest :alt: Documentation Status

The metacells package implements the improved metacell algorithm [1]_ for single-cell RNA sequencing (scRNA-seq) data analysis within the scipy <https://www.scipy.org/>_ framework, and projection algorithm based on it [2]. The original metacell algorithm [3] was implemented in R. The python package contains various algorithmic improvements and is scalable for larger data sets (millions of cells).

Metacell Analysis

Naively, scRNA_seq data is a set of cell profiles, where for each one, for each gene, we get a count of the mRNA molecules that existed in the cell for that gene. This serves as an indicator of how "expressed" or "active" the gene is.

As in any real world technology, the raw data may suffer from technical artifacts (counting the molecules of two cells in one profile, counting the molecules from a ruptured cells, counting only the molecules from the cell nucleus, etc.). This requires pruning the raw data to exclude such artifacts.

The current technology scRNA-seq data is also very sparse (typically <<10% the RNA molecules are counted). This introduces large sampling variance on top of the original signal, which itself contains significant inherent biological noise.

Analyzing scRNA-seq data therefore requires processing the profiles in bulk. Classically, this has been done by directly clustering the cells using various methods.

In contrast, the metacell approach groups together profiles of the "same" biological state into groups of cells of the "same" biological state, with the minimal number of profiles needed for computing robust statistics (in particular, mean gene expression). Each such group is a single "metacell".

By summing profiles of cells of the "same" state together, each metacell greatly reduces the sampling variance, and provides a more robust estimation of the transcription state. Note a metacell is not a cell type (multiple metacells may belong to the same "type", or even have the "same" state, if the data sufficiently over-samples this state). Also, a metacell is not a parametric model of the cell state. It is merely a more robust description of some cell state.

The metacells should therefore be further analyzed as if they were cells, using additional methods to classify cell types, detect cell trajectories and/or lineage, build parametric models for cell behavior, etc. Using metacells as input for such analysis techniques should benefit both from the more robust, less noisy input; and also from the (~100-fold) reduction in the number of cells to analyze when dealing with large data (e.g. analyzing millions of individual cells).

A common use case is taking a new data set and using an existing atlas with annotations (in particular, "type" annotations) to provide initial annotations for the new data set. As of version 0.9 this capability is provided by this package.

Metacell projection provides a quantitative "projected" genes profile for each query metacell in the atlas, together with a "corrected" one for the same subset of genes shared between the query and the atlas. Actual correction is optional, to be used only if there are technological differences between the data sets, e.g. 10X v2 vs. 10X v3. This allows performing a quantitative comparison between the projected and corrected gene expression profiles for determining whether the query metacell is a novel state that does not exist in the atlas, or, if it does match an atlas state, analyze any differences it may still have. This serves both for quality control and for quantitative analysis of perturbed systems (e.g. knockouts or disease models) in comparison to a baseline atlas.

Terminology and Results Format

NOTE: Version 0.9 breaks compatibility with version 0.8 when it comes to some APIs and the names and semantics of the result annotations. See below for the description of updated results (and how they differ from version 0.8). The new format is meant to improve the usability of the system in downstream analysis pipelines. For convenience we also list here the results of the new projection pipeline added in version 0.9.*. Versions 0.9.1 and 0.9.2 contain some bug fixes. Version 0.9.3 allows specifying target UMIs for the metacells, in addition to the target size in cells, and adaptively tries to satisfy both. This should produce better-sized metacells "out of the box" compared to the 0.9.[0-2] versions. The latest published version, 0.9.4, contains minor bug fixes and updates for newer versions of dependency packages.

If you have existing metacell data that was computed using version 0.8 (the current published version you will get from using pip install metacells, you can use the provided conversion script <https://github.com/tanaylab/metacells/blob/master/bin/convert_0.8_to_0.9.py>_ script to migrate your data to the format described below, while preserving any additional annotations you may have created for your data (e.g. metacells type annotations). The script will not modify your existing data files, so you can examine the results and tweak them if necessary.

In an upcoming version we will migrate from using AnnData to using daf to represent the data (h5ad files will still be supported, either directly through an adapter or via a conversion process). This will again unavoidingly break API compatibility, but will provide many advantages over the restricted AnnData APIs.

We apologize for the inconvenience.

Metacells Computation .....................

In theory, the only inputs required for metacell analysis are cell gene profiles with a UMIs count per gene per cell. In practice, a key part of the analysis is specifying lists of genes for special treatment. We use the following terminology for these lists:

excluded_gene, excluded_cell masks Excluded genes (and/or cells) are totally ignored by the algorithm (e.g. mytochondrial genes, cells with too few UMIs).

Deciding on the "right" list of excluded genes (and cells) is crucial for creating high-quality metacells. We rely
on the analyst to provide this list based on prior biological knowledge. To support this supervised task, we provide
the ``excluded_genes`` and ``exclude_cells`` functions which implement "reasonable" strategies for detecting some
(not all) of the genes and cells to exclude. For example, these will exclude any genes found by
``find_bursty_lonely_genes``, (called ``find_noisy_lonely_genes`` in v0.8). Additional considerations might be to
use ``relate_genes`` to (manually) exclude genes that are highly correlated with known-to-need-to-be-excluded genes,
or exclude any cells that are marked as doublets, etc.

Currently the 1st step of the processing must be to create a "clean" data set which lacks the excluded genes and
cells (e.g. using ``extract_clean_data``). When we switch to ``daf`` we'll just stay with the original data set and
apply the exclusion masks to the rest of the algorithm.

lateral_gene mask Lateral genes are forbidden from being selected for computing cells similarity (e.g., cell cycle genes). In version 0.8 these were called "forbidden" genes. Lateral genes are still counted towards the total UMIs count when computing gene expression levels for cells similarity. In addition, lateral genes are still used to compute deviant (outlier) cells. That is, each computed metacell should still have a consistent gene expression level even for lateral genes.

The motivation is that we don't want the algorithm to even try to create metacells based on these genes. Since these
genes may be very strong (again, cell cycle), they would overcome the cell-type genes we are interested in,
resulting in for example an "M-state" metacell which combines cells from several (similar) cell types.

Deciding on the "right" list of lateral genes is crucial for creating high-quality metacells. We rely on the analyst
to provide this list based on prior biological knowledge. To support this supervised task, we provide the
``relate_genes`` pipeline for identifying genes closely related to known lateral genes, so they can be added to the
list.

noisy_gene mask Noisy genes are given more freedom when computing deviant (outlier) cells. That is, we don't expect the expression level of such genes in the cells in the same metacell to be as consistent as we do for regular (non-noisy) genes. Note this isn't related to the question of whether the gene is lateral of not. That is, a gee maybe lateral, noisy, both, or neither.

The motivation is that some genes are inherently bursty and therefore cause many cells which are otherwise a good
match for their metacell to be marked as deviant (outliers). An indication for this is by examining the
``deviant_fold`` matrix (see below).

Deciding on the "right" list of noisy genes is again crucial for creating high-quality metacells (and minimizing the
fraction of outlier cells). Again we rely on the analyst here,

Having determined the inputs and possibly tweaking the hyper-parameters (a favorite one is the target_metacell_size, which by default is 160K UMIs; this may be reduced for small data sets and may be increased for larger data sets), one typically runs divide_and_conquer_pipeline to obtain the following:

metacell (index) vs. metacell_name (string) per cell The result of computing metacells for a set of cells with the above assigns each cell a metacell index. We also give each metacell a name of the format M<index>.<checksum> where the checksum reflects the cells grouped into the me

Metacells

Install / Use

README

Metacells 0.9.5 - Single-cell RNA Sequencing Analysis

Metacell Analysis

Terminology and Results Format

Related Skills