Monopogen
SNV calling from single cell sequencing
Install / Use
/learn @KChen-lab/MonopogenREADME
Monopogen: SNV calling from single cell sequencing data
News
- 10/24/2024:
- Resolve the issue by incorporating duplicate reads in the SNV calling.
- Update the cellxSNV matrix so that each element reflects the sequencing depth for both wild-type and mutated alleles, rather than using a binary representation.
- 4/12/2024: Version 1.6.0 released.
- We added the guide on putative SNV filtering based on cell type information derived from single cell RNA/ATAC-seq modalites.
- 2/26/2024: Version 1.5.0 released.
- In the cell-scan step, we implemented a motif-based search on wild/mutated alleles for all cells from the bam file directly. The single-cell level bam file splitting and joint calling modules were removed. This new version achieves over 10-fold speed up than the old version due to avoid the bam splitting. It could take less than 60 mins to collect the wild/mutated allele profiles of 10K cells over 20K loci.
- Recommended hard-filterings on putative somatic SNVs from Monopogen were added.
Table of Contents
- Introduction
- Installation
- Quick Start
- Germline SNV calling from snRNA-seq
- Somatic SNV calling from scRNA-seq
- Putative SNV filtering based on cell type information
- FAQs
- Citation
Introduction
Monopogen is an analysis package for SNV calling from single-cell sequencing, developed and maintained by Ken chen's lab in MDACC. Monopogen works on sequencing datasets generated from single cell RNA 10x 5', 10x 3', single ATAC-seq technoloiges, scDNA-seq etc.
It is composed of three modules:
- Data preprocess. This module removes reads with high alignment mismatches from single cell sequencing and also makes data formats compatiable with Monopongen.
- Germline SNV calling. Given the sparsity of single cell sequencing data, we leverage linkage disequilibrium (LD) from external reference panel(such as 1KG3, TopMed) to improve both SNV calling accuracy and detection sensitivity.
- Putative somatic SNV calling. We extended the machinery of LD refinement from human population level to cell population level. We statistically phase the observed alleles with adjacent germline alleles to estimate the degree of LD, taking into consideration widespread sparseness and allelic dropout in single-cell sequencing data, and calculated a probabilistic score as an indicator of somatic SNVs.
The output of Monopogen will enable 1) ancestry identificaiton on single cell samples; 2) genome-wide association study on the celluar level if sample size is sufficient, and 3) putative somatic SNV investigation.
Installation
Dependencies
- python (version >= 3.73)
- java (open JDK>=1.8.0)
- R (version >= 4.0.0)
- pandas>=1.2.3
- pysam>=0.16.0.1
- NumPy>=1.19.5
- sciPy>=1.6.3
- pillow>=8.2.0
- data.table(R package; version >=1.14.8)
- e1071 (R package; 1.7-13)
- ggplot2
!Note We have put the binary compatibility tools including samtools, bcftools, beagle in the app folder. We fixed the version because the output formats vary a lot with different versions. If you are not able to run them, you can compile them in you system. We only test on these tools on following versions:
- samtools Version: 1.2 (using htslib 1.2.1)
- bcftools Version: 1.8 (using htslib 1.8)
- beagle.27Jul16.86a.jar (version 4.1)
- tabix Version: 1.9
- bgzip Version: 1.9
If you meet other errors when running Monopogen, go to FAQs section.
Installation
Right now Monopogen is avaiable on github, you can install it through github
git clone https://github.com/KChen-lab/Monopogen.git
cd Monopogen
pip install -e .
Quick Start
Note the quick start exmaple only works for germline module. If you want to test somatic module, please go the section
For quick start of Monopogen, we provide an example dataset provided the example/ folder, which includes:
A.bam (.bai)
The bam file storing read alignment for sample A.B.bam (.bai)
The bam file storing read alignment for sample B.CCDG_14151_B01_GRM_WGS_2020-08-05_chr20.filtered.shapeit2-duohmm-phased.vcf.gz
The reference panel with over 3,000 samples in 1000 Genome database. Only SNVs located in chr20: 0-2Mb were extracted in this vcf file.chr20_2Mb.hg38.fa (.fai)
The genome reference used for read aligments. Only seuqences in chr20:0-2Mb were extracted in this fasta file. There are three test scripts in thetest/foldertest/runPreprocess.sh,test/runGermline.sh,test/runSomatic.shfor quick start ofMonopogen
Data preprocess
You can type the following command to get the help information.
path="XXX/Monopogen" # where Monopogen is downloaded
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:${path}/apps
python ${path}/src/Monopogen.py preProcess --help`
Output is
usage: Monopogen.py preProcess [-h] -b BAMFILE [-o OUT] -a APP_PATH
[-m MAX_MISMATCH] [-t NTHREADS]
optional arguments:
-h, --help show this help message and exit
-b BAMFILE, --bamFile BAMFILE
The bam file for the study sample, the bam file should
be sorted. If there are multiple samples, each row
with each sample (default: None)
-o OUT, --out OUT The output director (default: None)
-a APP_PATH, --app-path APP_PATH
The app library paths used in the tool (default: None)
-m MAX_MISMATCH, --max-mismatch MAX_MISMATCH
The maximal alignment mismatch allowed in one reads
for variant calling (default: 3)
-t NTHREADS, --nthreads NTHREADS
Number of threads used for SNVs calling (default: 1)
You need to prepare the bam file list for option -b.
path="XXy/Monopogen"
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:${path}/apps
python ${path}/src/Monopogen.py preProcess -b bam.lst -o out -a ${path}/apps
After running the preProcess module, there will be bam files after quality controls in the folder out/Bam/ used for downstream SNV calling.
Germline SNV calling
You can type the following command to get the help information.
python ${path}/src/Monopogen.py germline --help`
The output is
usage: Monopogen.py germline [-h] -r REGION -s
{varScan,varImpute,varPhasing,all} [-o OUT] -g
REFERENCE -p IMPUTATION_PANEL
[-m MAX_SOFTCLIPPED] -a APP_PATH [-t NTHREADS]
optional arguments:
-h, --help show this help message and exit
-r REGION, --region REGION
The genome regions for variant calling (default: None)
-s {varScan,varImpute,varPhasing,all}, --step {varScan,varImpute,varPhasing,all}
Run germline variant calling step by step (default:
all)
-o OUT, --out OUT The output director (default: None)
-g REFERENCE, --reference REFERENCE
The human genome reference used for alignment
(default: None)
-p IMPUTATION_PANEL, --imputation-panel IMPUTATION_PANEL
The population-level variant panel for variant
imputation refinement, such as 1000 Genome 3 (default:
None)
-a APP_PATH, --app-path APP_PATH
The app library paths used in the tool (default: None)
-t NTHREADS, --nthreads NTHREADS
Number of threads used for SNVs calling (default: 1)
You need to prepare the genome region file list for option -r with an example shown in test/region.lst. We also included an optimal genome region file in ${path}/resource/GRCh38.region.lst for the whole genome SNV calling. Each region is in one row. Run the test script test/runGermline.sh as following:
python ${path}/src/Monopogen.py germline \
-a ${path}/apps -t 1 -r region.lst \
-p ../example/ \
-g ../example/chr20_2Mb.hg38.fa -s all -o out
The germline module will generate the phased VCF files with name *.phased.vcf.gz in the folder out/germline. If there are multiple samples in the bam file list from -b option in preProcess module, the phased VCF files will contain genotypes from multiple samples. The output of phased genotypes are as following:
##fileformat=VCFv4.2
##filedate=20240227
##source="beagle.27Jul16.86a.jar (version 4.1)"
##INFO=<ID=AF,Number=A,Type=Float,Description="Estimated ALT Allele Frequencies">
##INFO=<ID=AR2,Number=1,Type=Float,Description="Allelic R-Squared: estimated squared correlation
##INFO=<ID=DR2,Number=1,Type=Float,Description="Dosage R-Squared: estimated squared correlation
##INFO=<ID=IMP,Number=0,Type=Flag,Description="Imputed marker">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DS,Number=A,Type=Float,Description="estimated ALT dose [P(RA) + P(AA)]">
##FORM
Related Skills
node-connect
339.3kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
83.9kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
339.3kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
83.9kCommit, push, and open a PR
