SCSA
SCSA: cell type annotation for single-cell RNA-seq data
Install / Use
/learn @bioinfo-ibms-pumc/SCSAREADME
SCSA: cell type annotation for single-cell RNA-seq data
Currently most methods take manual strategies to annotate cell types after clustering the single-cell RNA-seq data. Such methods are labor-intensive and heavily rely on user expertise, which may lead to inconsistent results. We present SCSA, an automatic tool to annotate cell types from single-cell RNA-seq data, based on a score annotation model combining differentially expressed genes and confidence levels of cell markers in databases. Evaluation on real scRNA-seq datasets that SCSA is able to assign the cells into the correct types at a fully automated mode with a desirable precision.<br><br> SCSA is maintained by Yinghao Cao [yhcao@ibms.pumc.edu.cn]. Any suggestion is welcome.
Update list (Date: 2023/4/12)
- CellMarker database v2 was integrated, the number of marker evidence increased from 48257 to 91969. User can use this version with cmd '-d whole_v2.db' instead.
Download and Installation
git clone https://github.com/bioinfo-ibms-pumc/SCSA.git
pip install pandas numpy scipy openpyxl
Command Lines
SCSA.py [-h] -i INPUT [-o OUTPUT] [-d DB] [-s SOURCE] [-c CLUSTER]
[-M MARKERDB] [-f FOLDCHANGE] [-p PVALUE] [-w WEIGHT]
[-g SPECIES] [-k TISSUE] [-m OUTFMT] [-T CELLTYPE]
[-t TARGET] [-E] [-N] [-b] [-l]
optional arguments:
-h, --help show this help message and exit
-i INPUT, --input INPUT
Input file for marker annotation(Only
CSV format supported).
-o OUTPUT, --output OUTPUT
Output file for marker annotation.
-d DB, --db DB Database for annotation. (whole.db)
-s SOURCE, --source SOURCE
Source of marker genes. (cellranger,[seurat],[scanpy],
[scran])
-c CLUSTER, --cluster CLUSTER
Only deal with one cluster of marker genes.
(all,[1],[1,2,3],[...])
-M MARKERDB, --MarkerDB MARKERDB
User-defined marker database in table format with two
columns.First column as Cellname, Second refers to
Genename.
-f FOLDCHANGE, --foldchange FOLDCHANGE
Fold change threshold for marker filtering. (2.0)
-p PVALUE, --pvalue PVALUE
P-value threshold for marker filtering. (0.05)
-w WEIGHT, --weight WEIGHT
Weight threshold for marker filtering from cellranger
v1.0 results. (100)
-g SPECIES, --species SPECIES
Species for annotation. Only used for cellmarker
database. ('Human',['Mouse'])
-k TISSUE, --tissue TISSUE
Tissue for annotation. Only used for cellmarker
database. Multiple tissues should be seperated
by commas.Run '-l' option to see all tissues.
In linux platform:('All',['Bone marrow'],['Bone marrow,Brain,Blood'][...])
In windows platform:("All",["Bone marrow"],["Bone marrow,Brain,Blood"][...])
-m OUTFMT, --outfmt OUTFMT
Output file format for marker annotation. (ms-
excel,[txt])
-T CELLTYPE, --celltype CELLTYPE
Cell type for annotation. (normal,[cancer])
-t TARGET, --target TARGET
Target to annotation class in Database.
(cellmarker,[cancersea])
-E, --Gensymbol Using gene symbol ID instead of ensembl ID in input
file for calculation.
-N, --norefdb Only using user-defined marker database for
annotation.
-b, --noprint Do not print any detail results.
-l, --list_tissue List tissue names in database.
Examples
- To annotate a human scRNA-seq sets generated by CellRanger, use the following code
python3 SCSA.py -d whole.db -i cellranger_pbmc_3k.csv -k All -g Human -p 0.01 -f 1.5 -m txt -o sc.txt
- To annotate a human scRNA-seq sets generated by 'FindAllMarkers' function of Seurat(Butler, A., et al. Nature Biotechnology. 2018) with ensemblIDs, use the following code
python3 SCSA.py -d whole.db -s seurat -i seurat_GSE72056.csv -k All -E -g Human -p 0.01 -f 1.5
- To annotate a human scRNA-seq sets generated by Scanpy, use the following code
##### scanpy_pbmc_3k.csv was genearted by following command from anndata object:
### result = adata.uns['rank_genes_groups']
### groups = result['names'].dtype.names
### dat = pd.DataFrame({group + '_' + key[:1]: result[key][group] for group in groups for key in ['names', 'logfoldchanges','scores','pvals']})
### dat.to_csv("scanpy_pbmc_3k.csv")
python3 SCSA.py -d whole.db -i scanpy_pbmc_3k.csv -s scanpy -E -f1.5 -p 0.01 -o result -m txt
- To annotate a human scRNA-seq sets generated by Scran, use the following code
###### scran_pbmc_3k.csv was generated by following command from sce object(due to its pairwise comparisons, we use the mean LFC instead):
### markers <- findMarkers(sce, sce$cluster, pval.type="all")
### res <- data.frame()
### for (i in names(markers)){
### predata <- subset(markers[[i]],select=c(p.value,FDR))
### meandata <- as.matrix(apply(subset(markers[[i]],select=-c(p.value,FDR)),1,mean))
### if (length(res) == 0){
### colnames(meandata) <- paste("LFC",i,sep="_")
### colnames(predata) <- paste(names(predata),i,sep="_")
### res <- cbind(predata,meandata)
### }else{
### predata <- predata[rownames(res),]
### meandata <- as.matrix(meandata[rownames(res),])
### colnames(meandata) <- paste("LFC",i,sep="_")
### colnames(predata) <- paste(names(predata),i,sep="_")
### res <- cbind(res,predata,meandata)
### }
### }
### write.csv(res,file="~/software/SCSA/new_scran_pbmc_3k.csv",quote=FALSE)
python SCSA.py -d whole.db -s scran -i scran_pbmc_3k.csv -k All -g Human -p 0.05 -f 1.1 -b
- To annotate a human scRNA-seq sets generated by 'FindAllMarkers' function of Seurat(Butler, A., et al. Nature Biotechnology. 2018) with both user-defined database and CellMarker database, use the following code
python3 SCSA.py -d whole.db -i seurat_GSE72056.csv -s seurat -E -f1.5 -p 0.01 -o result -m txt -M user.table
- To annotate a human scRNA-seq sets generated by CellRanger only with user-defined database without any detail print, use the following code
python3 SCSA.py -d whole.db -i cellranger_pbmc_3k.csv -f1.5 -p 0.01 -m txt -M user.table -N -b
- To annotate cluster1 of mouse scRNA-seq sets and To annotate cluster1 of mouse scRNA-seq sets generated by CellRanger, use the following code
python3 SCSA.py -d whole.db -s seurat -i seurat_mouse.csv -k All -E -g Mouse -p 0.01 -f 1 -m txt -o testout -c 1
- To list tissue names in the SCSA annotation database, use the following code
python3 SCSA.py -i none -d whole.db -l
Output explanation
The output information from stdout consists of five parts: "#Cluster","Type","Celltype","Score","Times"
“#Cluster” : The cluster id from input file.
“Type” : A subjective symbol for the prediction results.
“Good” means one of the following conditions:
1.Only one celltype found
2.The score of the first predicted celltype is more than twice as much
as the second predicted celltype.
3.The score of the second predicted celltype is a minus.
“?” means the score of the first predicted celltype is less than twice as much
as the second predicted celltype.
“E” means no celltype found.
“Celltype”: The predicted celltype name.
“Score” : The predicted score for a celltype normalized by Z-score method.
“nan” will be assigned if only one celltype found.
“Times” : The score of the first predicted celltype / The score of the second predicted celltype
If you use SCSA for your research, please kindly cite the following paper:
Cao Y, Wang X and Peng G (2020) SCSA: A Cell Type Annotation Tool for Single-Cell RNA-seq Data. Front. Genet. 11:490. doi: https://doi.org/10.3389/fgene.2020.00490
Related Skills
node-connect
351.2kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
110.6kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
351.2kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
351.2kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
