Celltypist
A tool for semi-automatic cell type classification
Install / Use
/learn @Teichlab/CelltypistREADME
CellTypist is an automated cell type annotation tool for scRNA-seq datasets on the basis of logistic regression classifiers optimised by the stochastic gradient descent algorithm. CellTypist allows for cell prediction using either built-in (with a current focus on immune sub-populations) or custom models, in order to assist in the accurate classification of different cell types and subtypes.
CellTypist website
Information of CellTypist can be also found in our CellTypist portal.
Interactive tutorials
Using CellTypist for cell type classification
Using CellTypist for multi-label classification
Best practice in large-scale cross-dataset label transfer using CellTypist
Install CellTypist
Using pip 
pip install celltypist
Using conda 
conda install -c bioconda -c conda-forge celltypist
Usage (classification)
<details> <summary><strong>1. Use in the Python environment</strong></summary>-
<details>
<summary><strong>1.1. Import the module</strong></summary>
</details>import celltypist from celltypist import models -
<details>
<summary><strong>1.2. Download available models</strong></summary>
The models serve as the basis for cell type predictions. Information of available models can be also found here.
#Show all available models that can be downloaded and used. models.models_description() #Download a specific model, for example, `Immune_All_Low.pkl`. models.download_models(model = 'Immune_All_Low.pkl') #Download a list of models, for example, `Immune_All_Low.pkl` and `Immune_All_High.pkl`. models.download_models(model = ['Immune_All_Low.pkl', 'Immune_All_High.pkl']) #Update the models by re-downloading the latest versions if you think they may be outdated. models.download_models(model = ['Immune_All_Low.pkl', 'Immune_All_High.pkl'], force_update = True) #Show the local directory storing these models. models.models_pathA simple way is to download all available models. Since each model is on average 1 megabyte (MB), we encourage the users to download all of them.
#Download all the available models. models.download_models() #Update all models by re-downloading the latest versions if you think they may be outdated. models.download_models(force_update = True)By default, a folder
.celltypist/will be created in the user's home directory to store model files. A different path/folder can be specified by exporting the environment variableCELLTYPIST_FOLDERin your configuration file (e.g. in~/.bash_profile).
</details>#In the shell configuration file. export CELLTYPIST_FOLDER='/path/to/model/folder/' -
<details>
<summary><strong>1.3. Overview of the models</strong></summary>
All models are serialised in a binary format by pickle.
</details>#Get an overview of the models that are downloaded in `1.2.`. #By default (`on_the_fly = False`), all possible models (even those that are not downloaded) are shown. models.models_description(on_the_fly = True) -
<details>
<summary><strong>1.4. Inspect the model of interest</strong></summary>
To take a look at a given model, load the model as an instance of the Model class as defined in CellTypist.
</details>#Select the model from the above list. If the `model` argument is not provided, will default to `Immune_All_Low.pkl`. model = models.Model.load(model = 'Immune_All_Low.pkl') #The model summary information. model #Examine cell types contained in the model. model.cell_types #Examine genes/features contained in the model. model.features -
<details>
<summary><strong>1.5. Celltyping based on the input of count table</strong></summary>
CellTypist accepts the input data as a count table (cell-by-gene or gene-by-cell) in the format of
.txt,.csv,.tsv,.tab,.mtxor.mtx.gz. A raw count matrix (reads or UMIs) is required. Non-expressed genes (if you are sure of their expression absence in your data) are suggested to be included in the input table as well, as they point to the negative transcriptomic signatures when compared with the model used.#Get a demo test data. This is a UMI count csv file with cells as rows and gene symbols as columns. input_file = celltypist.samples.get_sample_csv()Assign the cell type labels from the model to the input test cells using the celltypist.annotate function.
#Predict the identity of each input cell. predictions = celltypist.annotate(input_file, model = 'Immune_All_Low.pkl') #Alternatively, the model argument can be a previously loaded `Model` as in 1.4. predictions = celltypist.annotate(input_file, model = model)If your input file is in a gene-by-cell format (genes as rows and cells as columns), pass in the
transpose_input = Trueargument. In addition, if the input is provided in the.mtxformat, you will also need to specify thegene_fileandcell_filearguments as the files containing names of genes and cells, respectively.#In case your input file is a gene-by-cell table. predictions = celltypist.annotate(input_file, model = 'Immune_All_Low.pkl', transpose_input = True) #In case your input file is a gene-by-cell mtx file. predictions = celltypist.annotate(input_file, model = 'Immune_All_Low.pkl', transpose_input = True, gene_file = '/path/to/gene/file.txt', cell_file = '/path/to/cell/file.txt')Again, if the
modelargument is not specified, CellTypist will by default use theImmune_All_Low.pklmodel.The
annotatefunction will return an instance of the AnnotationResult class as defined in CellTypist.#Summary information for the prediction result. predictions #Examine the predicted cell type labels. predictions.predicted_labels #Examine the matrix representing the decision score of each cell belonging to a given cell type. predictions.decision_matrix #Examine the matrix representing the probability each cell belongs to a given cell type (transformed from decision matrix by the sigmoid function). predictions.probability_matrixBy default, with the
annotatefunction, each query cell is predicted into the cell type with the largest score/probability among all possible cell types (mode = 'best match'). This mode is straightforward and can be used to differentiate between highly homogeneous cell types.However, in some scenarios where a query cell cannot be assigned to any cell type in the reference model (i.e., a novel cell type) or can be assigned to multiple cell types (i.e., multi-label classification), a mode of probability match can be turned on (
mode = 'prob match') with a probability cutoff (default to 0.5,p_thres = 0.5) to decide the cell types (none, 1, or multiple) assigned for a given cell.#Query cell will get the label of 'Unassigned' if it fails to pass the probability cutoff in each cell type. #Query cell will get multiple label outputs (concatenated by '|') if more than one cell type passes the probability cutoff. predictions = celltypist.annotate(input_file, model = 'Immune_All_Low.pkl', mode = 'prob match', p_thres = 0.5)The three tables in the
AnnotationResult(.predicted_labels,.decision_matrixand.probability_matrix) can be written out to local files (tables) by the function to_table, specifying the targetfolderfor storage and theprefixcommon to each table.#Export the three results to csv tables. predictions.to_table(folder = '/path/to/a/folder', prefix = '') #Alternatively, export the three results to a single Excel table (.xlsx). predictions.to_table(folder = '/path/to/a/folder', prefix = '', xlsx = True)The resulting
AnnotationResultcan be also transformed to an AnnData which stores the expression matrix in the log1p normalised format (to 10,000 counts per cell) by the function [to_adata](https://celltypist.readthedocs.io/en/latest/celltypist.classif
