scNanoGPS: Single cell Nanopore sequencing analysis of Genotype and Phenotype Simultaneously

scNanoGPS is a computational toolkit for analyzing high throughput single cell nanopore sequencing data to detect Genotypes and Phenotype Simultaneously from same cells. scNanoGPS includes 5 major steps: 1) NanoQC to perform quality control of the raw seqeucning data; 2) Scanner to scan and filter out reads that do not have expected adapater sequence patterns, i.e., TrueSeq Read 1 adapter sequence, TSO adaper sequence, poly (A/T)n block sequence, Cell Barcodes (CB) and unique molecule identifier (UMI) sequence blocks; 3) Assigner to detect the list of true cell barcodes, merge cell barcodes with sequencing errors and assign raw reads into single cells; 4) Curator to detect reads with true UMIs and collapse them to make consensus sequences of individual molecules to curate sequencing errors on gene bodies; 5) Reporter to detect single cell transcriptomes, single cell gene isoforms and single cell mutations from consensus single cell long reads data.

Keywords

Single cell, Nanopore, RNA sequencing, long read, cell barcode demultiplex, UMI curation, gene expression, isoform, single nucleotide variation

Citing scNanoGPS

Shiau, CK., Lu, L., Kieser, R. et al. High throughput single cell long-read sequencing analyses of same-cell genotypes and phenotypes in human tumors. Nat Commun 14, 4124 (2023). https://doi.org/10.1038/s41467-023-39813-7

Update for scNanoGPS v2.0

Scanner is now using biopython Align module and no longer popping up deprecated warnings.
Assigner is now taking whitelist and is available for long-read spatial RNA-seq data.
Curator efficiency is doubled.
Isoform calling is now switching to IsoQuant.

Index

Installation
Step 1: NanoQC
Step 2: Scanner
Step 3: Assigner
Step 4: Curator
Step 5: Reporter

Installation

The scNanoGPS pipeline is built with python3. We recommend users to use anaconda/miniconda virtual environment to install it. Refer to Anaconda turorial for environment building.

Build python3 virtual environment

Example codes for creating and activating python3 environment on Linux-based OS:
```
conda create -n scNanoGPS python=3 numpy scipy
source activate scNanoGPS
```

Install scNanoGPS and dependencies

The scNanoGPS requires the following dependencies to work:

- biopython 1.80
- distance 0.1.3
- matplotlib 3.8.2
- pandas 2.1.4
- pysam 0.19.0
- seaborn 0.13.1

Example codes for obtaining scNanoGPS from GitHub and installation of dependencies:

git clone https://github.com/gaolabtools/scNanoGPS/
cd scNanoGPS
pip3 install -r requirements.txt

Install other essential tools

scNanoGPS uses the following third party tools for mapping again genome reference, collapsing reads with same UMIs, and sumamrizing single cell gene expression, isoform, and SNV profiles.

Example codes for installation of third party tools

minimap2 (GitHub, Anaconda)
```
conda install -c bioconda minimap2
```
Samtools (GitHub, Anaconda)
```
conda install -c bioconda samtools
```
tabix (Anaconda)
```
conda install -c bioconda tabix
```
SPOA (GitHub, Anaconda)
```
conda install -c bioconda spoa
```

SubRead featureCounts (SourceForge, Anaconda)

# You can download and unzip pre-compiled binary file from https://sourceforge.net/projects/subread/files/subread-2.0.3/

tar -xzf subread-2.0.3-<platform>.tar.gz

# or install subread via anaconda

conda install -c bioconda subread

IsoQuant (GitHub)

conda install -c conda-forge -c bioconda isoquant

Longshot (GitHub, Anaconda)
```
conda install -c bioconda longshot
```
BCFtools (GitHub, Anaconda)
```
conda install -c bioconda bcftools
```

ANNOVAR

# You can download ANNOVAR from https://www.openbioinformatics.org/annovar/annovar_download_form.php

tar -xvf annovar.latest.tar.gz

(optional) gffread (GitHub)

git clone https://github.com/gpertea/gffread
cd gffread
make release

Qualimap (Anaconda)

# You can download Qualimap from http://qualimap.conesalab.org/

unzip qualimap_v2.2.1.zip

# or install qualimap via anaconda

conda install -c bioconda qualimap

Prepare reference genome and annotations

Reference genome Users can obtain reference genome from NCBI, Ensembl, or any other autorities

wget https://ftp.ensembl.org/pub/release-100/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.toplevel.fa.gz

Gene annotation (GTF/GFF) Users can obtain reference gene annotations from NCBI, Ensembl, or any other autorities
```
wget https://ftp.ensembl.org/pub/release-100/gtf/homo_sapiens/Homo_sapiens.GRCh38.100.gtf.gz
```
- Note: Many tools cannot use compressed GTF file. Please try to gunzip compress GTF.gz beforehand.
Index reference genome for minimap2 Prepare indexed genome for minimap2 to boost mapping. Refer to the Minimap2 instruction. 
- Example code:
```
minimap2 -x map-ont -d example/GRCh38_chr22.mmi example/GRCh38_chr22.fa.gz
```

Annotation tables for ANNOVAR This version of scNanoGPS uses ANNOVAR to annotate single cell SNVs results, please refer to ANNOVAR's webpage for more information.

Example codes:

perl annotate_variation.pl -buildver hg38 -downdb -webfrom annovar refGene hg38db/
perl annotate_variation.pl -buildver hg38 -downdb cytoBand hg38db/
perl annotate_variation.pl -buildver hg38 -downdb -webfrom annovar gnomad30_genome hg38db/
perl annotate_variation.pl -buildver hg38 -downdb -webfrom annovar avsnp150 hg38db/
perl annotate_variation.pl -buildver hg38 -downdb -webfrom annovar dbnsfp42c hg38db/

Using scNanoGPS

Please use the wrapper script "run_scNanoGPS.py" for your convenience. Make sure to update the path inside the master script with any text editor you like, then run the script with the following command.

sh run_scNanoGPS.py

Manual of run_scNanoGPS.py

Usage: run_scNanoGPS.py [options]

Options:
  -h, --help            show this help message and exit
  -i FQ_F_NAME          * Required ! Input FastQ/Fast5 file name, or directory
                        containing multiple input files. Support
                        fastq/fq/fastq.gz/fq.gz/fast5 format.
  -d O_DIR              Output directory name. Default: scNanoGPS_res
  --tmp_dir=TMP_DIR     Temporary folder name. Default: tmp
  -p PROTOCOL           10x barcoding protocol. (3p / 5p / spatial) Default:
                        3p
  -t NCORES             Number of cores for program running. Default: 1
  --gtf=GTF             * Required ! GTF file for expression calling.
  --ref_genome=REF_GENOME
                        * Required ! File for reference genome.
  --idx_genome=IDX_GENOME
                        Path to the Minimap2 genome index. Program will use
                        reference genome if no Minimap2 genome index given.
                        Default: None
  --whitelist=WHITELIST
                        Path to the cell barcode whitelist. Default: None
  --exc_bed=EXC_BED     Exclude specific regions (BED) in file. Default: None
  --isoquant=ISOQUANT   Provide path to IsoQuant to conduct isoform calling.
                        Default: None
  --annovar=ANNOVAR     Provide directory path to ANNOVAR to conduct SNP
                        calling. Default: None
  --annovardb=ANNOVARDB
                        Name of ANNOVAR database. Default: hg38db
  --annovargv=ANNOVARGV
                        Version of ANNOVAR genome version. Default: hg38
  --annovarprot=ANNOVARPROT
                        Analysis protocol of ANNOVAR. Default: refGene,cytoBan
                        d,gnomad30_genome,avsnp150,dbnsfp42c,cosmic96_coding,c
                        osmic96_noncoding
  --annovarop=ANNOVAROP
                        Analysis operation of ANNOVAR. Default: gx,r,f,f,f,f,f
  --annovar_xref=ANNOVAR_XREF
                        Path to cross-reference genome of ANNOVAR. Default:

ScNanoGPS

Install / Use

README