CellTypist is an automated cell type annotation tool for scRNA-seq datasets on the basis of logistic regression classifiers optimised by the stochastic gradient descent algorithm. CellTypist allows for cell prediction using either built-in (with a current focus on immune sub-populations) or custom models, in order to assist in the accurate classification of different cell types and subtypes.

CellTypist website

Information of CellTypist can be also found in our CellTypist portal.

Interactive tutorials

Using CellTypist for cell type classification
Using CellTypist for multi-label classification
Best practice in large-scale cross-dataset label transfer using CellTypist

Install CellTypist

Using pip

pip install celltypist

Using conda

conda install -c bioconda -c conda-forge celltypist

Usage (classification)

<details> <summary>1. Use in the Python environment</summary>

<details> <summary>1.1. Import the module</summary>
```
import celltypist
from celltypist import models
```
</details>

<details> <summary>1.2. Download available models</summary>

The models serve as the basis for cell type predictions. Information of available models can be also found here.

#Show all available models that can be downloaded and used.
models.models_description()
#Download a specific model, for example, `Immune_All_Low.pkl`.
models.download_models(model = 'Immune_All_Low.pkl')
#Download a list of models, for example, `Immune_All_Low.pkl` and `Immune_All_High.pkl`.
models.download_models(model = ['Immune_All_Low.pkl', 'Immune_All_High.pkl'])
#Update the models by re-downloading the latest versions if you think they may be outdated.
models.download_models(model = ['Immune_All_Low.pkl', 'Immune_All_High.pkl'], force_update = True)
#Show the local directory storing these models.
models.models_path

A simple way is to download all available models. Since each model is on average 1 megabyte (MB), we encourage the users to download all of them.

#Download all the available models.
models.download_models()
#Update all models by re-downloading the latest versions if you think they may be outdated.
models.download_models(force_update = True)

By default, a folder .celltypist/ will be created in the user's home directory to store model files. A different path/folder can be specified by exporting the environment variable CELLTYPIST_FOLDER in your configuration file (e.g. in ~/.bash_profile).

#In the shell configuration file.
export CELLTYPIST_FOLDER='/path/to/model/folder/'

</details>

<details> <summary>1.3. Overview of the models</summary>

All models are serialised in a binary format by pickle.

#Get an overview of the models that are downloaded in `1.2.`.
#By default (`on_the_fly = False`), all possible models (even those that are not downloaded) are shown.
models.models_description(on_the_fly = True)

</details>

<details> <summary>1.4. Inspect the model of interest</summary>

To take a look at a given model, load the model as an instance of the Model class as defined in CellTypist.

#Select the model from the above list. If the `model` argument is not provided, will default to `Immune_All_Low.pkl`.
model = models.Model.load(model = 'Immune_All_Low.pkl')
#The model summary information.
model
#Examine cell types contained in the model.
model.cell_types
#Examine genes/features contained in the model.
model.features

</details>

<details> <summary>1.5. Celltyping based on the input of count table</summary>
CellTypist accepts the input data as a count table (cell-by-gene or gene-by-cell) in the format of .txt, .csv, .tsv, .tab, .mtx or .mtx.gz. A raw count matrix (reads or UMIs) is required. Non-expressed genes (if you are sure of their expression absence in your data) are suggested to be included in the input table as well, as they point to the negative transcriptomic signatures when compared with the model used.
```
#Get a demo test data. This is a UMI count csv file with cells as rows and gene symbols as columns.
input_file = celltypist.samples.get_sample_csv()
```
Assign the cell type labels from the model to the input test cells using the celltypist.annotate function.
```
#Predict the identity of each input cell.
predictions = celltypist.annotate(input_file, model = 'Immune_All_Low.pkl')
#Alternatively, the model argument can be a previously loaded `Model` as in 1.4.
predictions = celltypist.annotate(input_file, model = model)
```
If your input file is in a gene-by-cell format (genes as rows and cells as columns), pass in the transpose_input = True argument. In addition, if the input is provided in the .mtx format, you will also need to specify the gene_file and cell_file arguments as the files containing names of genes and cells, respectively.
```
#In case your input file is a gene-by-cell table.
predictions = celltypist.annotate(input_file, model = 'Immune_All_Low.pkl', transpose_input = True)
#In case your input file is a gene-by-cell mtx file.
predictions = celltypist.annotate(input_file, model = 'Immune_All_Low.pkl', transpose_input = True, gene_file = '/path/to/gene/file.txt', cell_file = '/path/to/cell/file.txt')
```
Again, if the model argument is not specified, CellTypist will by default use the Immune_All_Low.pkl model.

The annotate function will return an instance of the AnnotationResult class as defined in CellTypist.
```
#Summary information for the prediction result.
predictions
#Examine the predicted cell type labels.
predictions.predicted_labels
#Examine the matrix representing the decision score of each cell belonging to a given cell type.
predictions.decision_matrix
#Examine the matrix representing the probability each cell belongs to a given cell type (transformed from decision matrix by the sigmoid function).
predictions.probability_matrix
```
By default, with the annotate function, each query cell is predicted into the cell type with the largest score/probability among all possible cell types (mode = 'best match'). This mode is straightforward and can be used to differentiate between highly homogeneous cell types.

However, in some scenarios where a query cell cannot be assigned to any cell type in the reference model (i.e., a novel cell type) or can be assigned to multiple cell types (i.e., multi-label classification), a mode of probability match can be turned on (mode = 'prob match') with a probability cutoff (default to 0.5, p_thres = 0.5) to decide the cell types (none, 1, or multiple) assigned for a given cell.
```
#Query cell will get the label of 'Unassigned' if it fails to pass the probability cutoff in each cell type.
#Query cell will get multiple label outputs (concatenated by '|') if more than one cell type passes the probability cutoff.
predictions = celltypist.annotate(input_file, model = 'Immune_All_Low.pkl', mode = 'prob match', p_thres = 0.5)
```
The three tables in the AnnotationResult (.predicted_labels, .decision_matrix and .probability_matrix) can be written out to local files (tables) by the function to_table, specifying the target folder for storage and the prefix common to each table.
```
#Export the three results to csv tables.
predictions.to_table(folder = '/path/to/a/folder', prefix = '')
#Alternatively, export the three results to a single Excel table (.xlsx).
predictions.to_table(folder = '/path/to/a/folder', prefix = '', xlsx = True)
```
The resulting AnnotationResult can be also transformed to an AnnData which stores the expression matrix in the log1p normalised format (to 10,000 counts per cell) by the function [to_adata](https://celltypist.readthedocs.io/en/latest/celltypist.classif

Celltypist

Install / Use

README

CellTypist website

Interactive tutorials

Install CellTypist

Using pip

Using conda

Usage (classification)