SkillAgentSearch skills...

Monopogen

SNV calling from single cell sequencing

Install / Use

/learn @KChen-lab/Monopogen
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

Monopogen: SNV calling from single cell sequencing data

News

  • 10/24/2024:
    • Resolve the issue by incorporating duplicate reads in the SNV calling.
    • Update the cellxSNV matrix so that each element reflects the sequencing depth for both wild-type and mutated alleles, rather than using a binary representation.
  • 4/12/2024: Version 1.6.0 released.
    • We added the guide on putative SNV filtering based on cell type information derived from single cell RNA/ATAC-seq modalites.
  • 2/26/2024: Version 1.5.0 released.
    • In the cell-scan step, we implemented a motif-based search on wild/mutated alleles for all cells from the bam file directly. The single-cell level bam file splitting and joint calling modules were removed. This new version achieves over 10-fold speed up than the old version due to avoid the bam splitting. It could take less than 60 mins to collect the wild/mutated allele profiles of 10K cells over 20K loci.
    • Recommended hard-filterings on putative somatic SNVs from Monopogen were added.

Table of Contents

Introduction

Monopogen is an analysis package for SNV calling from single-cell sequencing, developed and maintained by Ken chen's lab in MDACC. Monopogen works on sequencing datasets generated from single cell RNA 10x 5', 10x 3', single ATAC-seq technoloiges, scDNA-seq etc.

<image src="./example/Fig1.png" width="600">

It is composed of three modules:

  • Data preprocess. This module removes reads with high alignment mismatches from single cell sequencing and also makes data formats compatiable with Monopongen.
  • Germline SNV calling. Given the sparsity of single cell sequencing data, we leverage linkage disequilibrium (LD) from external reference panel(such as 1KG3, TopMed) to improve both SNV calling accuracy and detection sensitivity.
  • Putative somatic SNV calling. We extended the machinery of LD refinement from human population level to cell population level. We statistically phase the observed alleles with adjacent germline alleles to estimate the degree of LD, taking into consideration widespread sparseness and allelic dropout in single-cell sequencing data, and calculated a probabilistic score as an indicator of somatic SNVs.

The output of Monopogen will enable 1) ancestry identificaiton on single cell samples; 2) genome-wide association study on the celluar level if sample size is sufficient, and 3) putative somatic SNV investigation.

Installation

Dependencies

  • python (version >= 3.73)
  • java (open JDK>=1.8.0)
  • R (version >= 4.0.0)
  • pandas>=1.2.3
  • pysam>=0.16.0.1
  • NumPy>=1.19.5
  • sciPy>=1.6.3
  • pillow>=8.2.0
  • data.table(R package; version >=1.14.8)
  • e1071 (R package; 1.7-13)
  • ggplot2

!Note We have put the binary compatibility tools including samtools, bcftools, beagle in the app folder. We fixed the version because the output formats vary a lot with different versions. If you are not able to run them, you can compile them in you system. We only test on these tools on following versions:

  • samtools Version: 1.2 (using htslib 1.2.1)
  • bcftools Version: 1.8 (using htslib 1.8)
  • beagle.27Jul16.86a.jar (version 4.1)
  • tabix Version: 1.9
  • bgzip Version: 1.9

If you meet other errors when running Monopogen, go to FAQs section.

Installation

Right now Monopogen is avaiable on github, you can install it through github

git clone https://github.com/KChen-lab/Monopogen.git
cd Monopogen
pip install -e .

Quick Start

Note the quick start exmaple only works for germline module. If you want to test somatic module, please go the section

For quick start of Monopogen, we provide an example dataset provided the example/ folder, which includes:

  • A.bam (.bai)
    The bam file storing read alignment for sample A.
  • B.bam (.bai)
    The bam file storing read alignment for sample B.
  • CCDG_14151_B01_GRM_WGS_2020-08-05_chr20.filtered.shapeit2-duohmm-phased.vcf.gz
    The reference panel with over 3,000 samples in 1000 Genome database. Only SNVs located in chr20: 0-2Mb were extracted in this vcf file.
  • chr20_2Mb.hg38.fa (.fai)
    The genome reference used for read aligments. Only seuqences in chr20:0-2Mb were extracted in this fasta file. There are three test scripts in the test/ folder test/runPreprocess.sh, test/runGermline.sh, test/runSomatic.sh for quick start of Monopogen

Data preprocess

You can type the following command to get the help information.

path="XXX/Monopogen"  # where Monopogen is downloaded
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:${path}/apps
python ${path}/src/Monopogen.py  preProcess --help`

Output is

usage: Monopogen.py preProcess [-h] -b BAMFILE [-o OUT] -a APP_PATH
                               [-m MAX_MISMATCH] [-t NTHREADS]

optional arguments:
  -h, --help            show this help message and exit
  -b BAMFILE, --bamFile BAMFILE
                        The bam file for the study sample, the bam file should
                        be sorted. If there are multiple samples, each row
                        with each sample (default: None)
  -o OUT, --out OUT     The output director (default: None)
  -a APP_PATH, --app-path APP_PATH
                        The app library paths used in the tool (default: None)
  -m MAX_MISMATCH, --max-mismatch MAX_MISMATCH
                        The maximal alignment mismatch allowed in one reads
                        for variant calling (default: 3)
  -t NTHREADS, --nthreads NTHREADS
                        Number of threads used for SNVs calling (default: 1)

You need to prepare the bam file list for option -b.

path="XXy/Monopogen"
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:${path}/apps
python  ${path}/src/Monopogen.py  preProcess -b bam.lst -o out  -a ${path}/apps

After running the preProcess module, there will be bam files after quality controls in the folder out/Bam/ used for downstream SNV calling.

Germline SNV calling

You can type the following command to get the help information.

python ${path}/src/Monopogen.py  germline --help`

The output is

usage: Monopogen.py germline [-h] -r REGION -s
                             {varScan,varImpute,varPhasing,all} [-o OUT] -g
                             REFERENCE -p IMPUTATION_PANEL
                             [-m MAX_SOFTCLIPPED] -a APP_PATH [-t NTHREADS]

optional arguments:
  -h, --help            show this help message and exit
  -r REGION, --region REGION
                        The genome regions for variant calling (default: None)
  -s {varScan,varImpute,varPhasing,all}, --step {varScan,varImpute,varPhasing,all}
                        Run germline variant calling step by step (default:
                        all)
  -o OUT, --out OUT     The output director (default: None)
  -g REFERENCE, --reference REFERENCE
                        The human genome reference used for alignment
                        (default: None)
  -p IMPUTATION_PANEL, --imputation-panel IMPUTATION_PANEL
                        The population-level variant panel for variant
                        imputation refinement, such as 1000 Genome 3 (default:
                        None)
  -a APP_PATH, --app-path APP_PATH
                        The app library paths used in the tool (default: None)
  -t NTHREADS, --nthreads NTHREADS
                        Number of threads used for SNVs calling (default: 1)

You need to prepare the genome region file list for option -r with an example shown in test/region.lst. We also included an optimal genome region file in ${path}/resource/GRCh38.region.lst for the whole genome SNV calling. Each region is in one row. Run the test script test/runGermline.sh as following:

python  ${path}/src/Monopogen.py  germline  \
    -a   ${path}/apps -t 1   -r  region.lst \
    -p  ../example/  \
    -g  ../example/chr20_2Mb.hg38.fa   -s all  -o out

The germline module will generate the phased VCF files with name *.phased.vcf.gz in the folder out/germline. If there are multiple samples in the bam file list from -b option in preProcess module, the phased VCF files will contain genotypes from multiple samples. The output of phased genotypes are as following:

##fileformat=VCFv4.2
##filedate=20240227
##source="beagle.27Jul16.86a.jar (version 4.1)"
##INFO=<ID=AF,Number=A,Type=Float,Description="Estimated ALT Allele Frequencies">
##INFO=<ID=AR2,Number=1,Type=Float,Description="Allelic R-Squared: estimated squared correlation
##INFO=<ID=DR2,Number=1,Type=Float,Description="Dosage R-Squared: estimated squared correlation
##INFO=<ID=IMP,Number=0,Type=Flag,Description="Imputed marker">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DS,Number=A,Type=Float,Description="estimated ALT dose [P(RA) + P(AA)]">
##FORM

Related Skills

View on GitHub
GitHub Stars118
CategoryDevelopment
Updated12h ago
Forks24

Languages

HTML

Security Score

100/100

Audited on Mar 28, 2026

No findings