Introduction

The types, states, and interactions of cells in human tissues vary greatly. Single-cell transcriptome sequencing (scRNA-seq) is a new technique for high-throughput sequencing analysis of the transcriptome in single cell. Single-cell transcriptome sequencing can complement conventional transcriptome sequencing (mRNA-seq: Bulk RNA sequencing, comparing the average expression values of genes in all cells of the cell population), revealing the expression situation of all genes in the all-cause group in single cell, including the identified tissue cell types, reflecting the cell heterogeneity between different samples and the tissue microenvironment, so that we can better understand the real state and correlation of each cell in a Bulk tissue. Presents a real and comprehensive cellular world. Currently, single-cell transcriptome sequencing is mostly used in complex multicellular systems such as tumor, developmental, neural, and immune microenvironments.

The purpose of this tool is connect the analysis of single-cell data into a complete process to accelerate the speed of analysis and contribute to the progress in this field.

Tutorials link (document and video): https://github.com/OpenGene/scrnapip/tree/main/tutorials.

A. Environment set up

1. Download docker

docker pull zhangjing12/scrnapip

2. Download reference genomic

#Human reference (GRCh38) dataset required for Cell Ranger.
wget https://cf.10xgenomics.com/supp/cell-exp/refdata-gex-GRCh38-2020-A.tar.gz
#Mouse reference dataset required for Cell Ranger.
wget https://cf.10xgenomics.com/supp/cell-exp/refdata-gex-mm10-2020-A.tar.gz

3. Use docker

docker run -d -p 1921:8787 -p 1882:3838 -e PASSWORD=yourpassword -e USERID=youruserid -e GROUPID=yourgroupid -v /yourdatapath:/dockerpath zhangjing12/scrnapip

The image is created based on Rocker (https://rocker-project.org/images/versioned/rstudio.html). You can use the above command to access rstudio through port 8787, which is more convenient for users to use the process. The userid and groupid can be queried through the id command. For the port number, please confirm whether the corresponding port is open.

B. Start Workflow

1. Set config file

All input files and parameters are set in this configuration file. The main settings that need to be changed are the following：

#####[fastp_cellrange]: RAW data path. The pair end data must be split into two files
S1.R1=["/usr/data/SAMPLE1_S1_L001_R1_001.fastq.gz"]
S1.R2=["/usr/data/SAMPLE1_S1_L001_R2_001.fastq.gz"]
#If a sample has more than one raw data, you can merge them before or add path split by ",":
S1.R1=["/usr/data/SAMPLE1.1_S1_L001_R1_001.fastq.gz","/usr/data/SAMPLE1.2_S1_L001_R1_001.fastq.gz"]
S1.R2=["/usr/data/SAMPLE1.1_S1_L001_R2_001.fastq.gz","/usr/data/SAMPLE1.2_S1_L001_R2_001.fastq.gz"]

#####[indata]: cellranger matrix file path
S1="/usr/workout/02.cellranger/S1/outs/filtered_feature_bc_matrix"

#####[outpath]: output path
outpath="/usr/workout"

#####[tempdata]rds file output path
tempdata="workout"

#####[run]: The analysis that needs to be done should set to true, for example:
fastp=true #run fastp

#####[fastp]: Configure the fastp path and parameters
fastppath="/home/bin/fastp"
longr=26 #R1 length after trim
ncode=5 #The maximum number of N-bases

#####[cellrangle]: Configure the cellranger path and parameters
dockerusr="1025:1025" #user id
dir="/user/name" #The folder which docker mount
ref="/user/refdata-gex-GRCh38-2020-A" #Reference genome path
cellrangerpath="/home/bin/cellranger-7.1.0/cellranger" #software path of cellranger
expectcell=10000 #expect cell number
localcores=32 #Number of threads
localmem=64 #Memory size
include_introns="false" #Whether to analyze introns

#####[step1]:
filetype="10x" #The format of the input file,could be "10x" or "csv"
csv_sep="" #Separator of the csv file
nFeature_RNA=[200,5000] #The cells were filtered by feature, keeping cells that feature between 200 and 5000 
percent_mt=[0,10] #The cells were filtered by percent of mitochondria, keeping cells that percent of mitochondria less than 10%
mttype="MT" #Mitochondrial type, MT for humans and mt for mice

#####[step2]:
kfilter=200 #Minimum number of cells per sample
normethod="SCT" #The merge method, which uses SCT by default, can also use vst to simply group samples together
nFeature=3000 #Genes for subsequent analysis

#####[step3]:
recluster=true #Whether to perform batch correction
mode="harmony" #Method of batch correction
heatmapnumber=9 #Number of heatmaps drawn for pca
elbowdims=100 #The number of PCS shown in the elbow diagram
dims=30 #Select the top 30 PCs for dimensionality reduction
reduction="umap" #tSNE or UMAP
clustercell=true #Whether you need to cluster cells
resolution=0.6 #Set the resolution when clustering
algorithm=1 #Cluster modular optimization algorithm (1 = original Louvain algorithm; 2 = Louvain algorithm with multilevel refinement; 3 = SLM algorithm)
singler="/home/bin/singleRdata/singleRdata/test.rds" #singleR database position

#####[step4]:
clustermarkers=true #Whether marker genes of each cluster need to be found
min_pct=0.25 #The minimum proportion of marker gene in the number of cells is 0.25 by default
findmarkers_testuse="wilcox" #The method of finding marker gene
difcluster.test.a=[0,1] #Find Differential gene.If you want to find differences between samples,change cluster to ident
difcluster.test.b=[5,6] #Test indicates the group name, a for case and b for control
difcluster.test.testuse="wilcox" #Inspection method
ClusterProfiler=["true","Rscript","/home/bin/clusterProfiler.R","-a true -s org.Hs.eg.db,hsa,human -g 6 -t SYMBOL -d KEGG,BioCyc,PID,PANTHER,BIOCARTA -C 0.05"] #Enrichment analysis of difference analysis results. -a: Whether to use all background genes; -s: species; -g: The column of the gene in the file; -t: gene name type(SYMBOL,ENTREZID); -d: database name

#####[step5]:
meanexpression=0.5 #Select the appropriate gene to mark the state, intercept the condition, default is 0.5
genenum=50 #Number of gene in differential analysis heat map
numclusters=4 #The number of clusters in a cluster
pointid=1 #The branching points used in BEAM analysis
BEAMnumclusters=4 #Number of clusters in heat map clustering
BEAMgn=50 #BEAM analyzes heat map gene count
BEAMgenelist=["S100A12","ALOX5AP","PAD14","NRG1","MCEMP1","THBS1"] #BEAM analyzes specific gene names

#####[step6]:
circosbin="/home/bin/get_exp.r" #Extraction expression
circos_perl_bin="/home/bin/circos_plot.pl" #Plot circos

#####[step7]:
copykat_bin="/home/bin/copykat_v4.r" #Identify tumor cells

#####[step8]:
cytoTRACE_bin="/home/bin/cytotrace_230508.R" #Developmental potential analysis

#####[step9]:
genomicinstably_bin="/home/bin/genomicinstably.R" #Genomic instability analysist
org="human" #species

#####[step11]:
ClusterProfiler=["true","Rscript","/home/bin/clusterProfiler.R","-a true -s org.Hs.eg.db,hsa,human -g 1 -t SYMBOL -d KEGG,BioCyc,PID,PANTHER,BIOCARTA -C 0.05"] #Enrichment analysis of marker gene

2. Filtered data by fastp and cellranger

This R script is used for data filtering and comparison quantitative analysis, and relevant parameters are set in the configuration file config_Example.ini.

Rscript /home/bin/fastp_cellranger.r -i config_Example.ini

3. Seurat analysis

This R script is used for all advanced analyses, and relevant parameters are set in the configuration file config_Example.ini.

Rscript /home/bin/singlecell.r -i config_Example.ini

C. Result

1. Fastp

Sequence statistics and reads filtering result files were performed on the original data.

── 01.Fastp/
     └── <SampleName>/                                  <- config for report
            ├── <SampleName>_fastp.html                 <- Report generated by fastp
            ├── <SampleName>_fastp.json                 <- Statistical information generated by fastp
            ├── <SampleName>_S1_L001_R1_001.fastq.gz    <- R1 clean data
            └── <SampleName>_S1_L001_R2_001.fastq.gz    <- R2 clean data

fastp_summary

Quality control summary statistics by fastp.

2. Cellranger

Cellranger results after mapping and quantitative.

── 02.Cellranger/
     └── <SampleName>/                       
              └── outs/                                 <- Cellranger analysis results 
                    ├── analysis/                       <- Cluster by cellranger
                    ├── raw_feature_bc_matrix/          <- Unfiltered feature-barcode matrices MEX (usually not used)
                    ├── raw_feature_bc_matrix.h5        <- Unfiltered feature-barcode matrices HDF5
                    ├── filtered_feature_bc_matrix/     <- Filtered feature-barcode matrices MEX
                    ├── filtered_feature_bc_matrix.h5   <- Filtered feature-barcode matrices HDF5
                    ├── molecule_info.h5                <- Per-molecule r

Scrnapip

Install / Use

README