DeepMosaic
DeepMosaic is a deep-learning-based mosaic single nucleotide classification tool without the need of matched control information.
Install / Use
/learn @XiaoxuYangLab/DeepMosaicREADME
DeepMosaic <img src="https://user-images.githubusercontent.com/17311837/88461876-52d18f80-ce5c-11ea-9aed-534dfd07d351.png" alt="DeepMosaic_Logo" width=15%>
Visualization and control-independent classification tool of noncancer (somatic or germline) mosaic single nucleotide variants (SNVs) with deep convolutional neural networks. Originally written by Virginia (Xin) Xu and Xiaoxu Yang, maintained by Arzoo Patel and Sang Lee.
Contents
-Step 2. Prediction for mosaicism (DeepMosaic Classification Module)
Overview
- <b>DeepMosaic Visualization Module:</b> Information of aligned sequences for any SNV represented with an RGB image:
An RGB image was used to represent the pileup results for all the reads aligned to a single genomic position. Reads supporting different alleles were grouped, in the order of the reference allele, the first, second, and third alternative alleles, respectively. Red channel was used to represent the bases, green channel for the base qualities, and blue channel for the strand orientations of the read. Note that the green channel is modified to show better contrast for human eyes.
- <b>DeepMosaic Classification Module:</b> Workflow from variant to result (10 models were compared and Efficientnet b4 was selected as default because it performed the best on a gold standard benchmark dataset.):
Workflow of DeepMosaic on best-performed deep convolutional neural network model after benchmarking. Variants were first transformed into images based on the alignment information. A deep convolution neural network then extracted the high-dimensional information from the image, and experimental, genomic, and population-related information was further incorporated into the classifier.
<details><summary>
Requirements before you start
</summary>- Python 3.7
- git-lfs for the system you work on
- BEDTools (command line)
- ANNOVAR (command line)
- PyTables
- Matplotlib
- pandas
- Pysam
- PyTorch version>=1.6.0
- EfficientNet PyTorch version>=0.7.1
- argparse
Some of the versions of packages are provided as an example in this list.
Alternatively, you can use singularity or docker container. See Singularity and Docker.
</details><details><summary>
Installation
</summary>We are now providing singularity image and docker image to run DeepMosaic. If you want to install and run DeepMosaic manually, please read through and follow these steps. The following steps could be performed in a command line shell environment (Linux, Mac, Windows subsystem Linux etc., whichever has the computational resource and >20G storage to run DeepMosaic)
Step 1. Install DeepMosaic
Make sure you have <b>git-lfs</b> installed in your environment to be able to download this repository correctly. Download git-lfs, unzip the tar.gz and put the binary file git-lfs in your bin folder/your $PATH, and run git lfs install to initialize git-lfs. You only need to do it once.
> git clone --recursive https://github.com/shishenyxx/DeepMosaic
Make sure you cloned the whole repository, total folder size should be ~ 4G.
> cd DeepMosaic
Step 2. Install dependency: BEDTools (via conda)
> conda install -c bioconda bedtools
Step 3. Install dependency: ANNOVAR
a) Go to the ANNOVAR website and click "here" to register and download the annovar distribution.
b) Once you have sucessfully download ANNOVAR package, run
> cd [path to ANNOVAR]
> perl ./annotate_variation.pl -buildver hg19 -downdb -webfrom annovar gnomad_genome humandb/
to install the hg19.gnomad_genome file needed for the feature extraction from the bam file
</details>Usage
<details><summary>Step 1. Feature extraction and visualization of the candidate mosaic variants (Visualization Module)
</summary>This step is used for the extraction of genomic features of the variant from raw bams as well as population information. It can serve as an independent tool for the visualization and evaluation of mosaic candidates.
Usage
> [DeepMosaic Path]/deepmosaic/deepmosaic-draw -i <input.txt> -o <output_dir> -a <path_to_ANNOVAR> -b <genome_build> -db <name_of_annovar_db>
Note:
input.txtfile should be in the following format.
Input format
|#sample_name|bam|vcf|depth|sex| |---|---|---|---|---| |sample_1|sample_1.bam|sample_1.vcf|200|M| |sample_2|sample_2.bam|sample_2.vcf|200|F|
Each line of input.txt is a sample with its aligned reads in the bam format (with index in the same directory), and its candidate variants in the vcf (or vcf.gz) format. User should also provide the sequencing depth and the sex (M/F) of the corresponding sample. Sample name (#sample_name column) should be a unique identifier for each sample; duplicated names are not allowed.
Note the sequencing depth is required for increasing specificity and if the user is not clear about the average depth, we recommend piloting a fast depth analysis with SAMtools mpileup for several hundreds of variants, or a complete depth of coverage analysis. The depth value should be integers.
-
DeepMosaic supports no-loss image representation for sequencing depth up to 500x. Reads with deeper sequencing depth will be randomly down-sampled to 500x during image representation.
-
sample.bamis a bam file that is generated through alignment, sort, markduplicate, indel realign, and base quality score recalibration. You can follow the BSMN common pipeline for both GRCh37 and GRCh38, or this pipeline for GRCh37 alignment specifically. Note that this used to be the best pipeline for GATK3 and earlier version. GATK4 onwards, however, integrated indel realign into haplotypecaller and MuTect2. So if you want to use any external tools you have to prepare the bam with earlier GATK and the tutorials should be here. -
sample.vcfis the vcf file of input variants you are interested in, or prior file generated by GATK haplotypecaller with polidy 50 as described in previosu pipelines, or MuTect2 single mode, each vcf should be provided for each input bam and the format should be in the following format, gziped vcf is also recognizable:
sample.vcf format
|#CHROM|POS|ID|REF|ALT|...| |---|---|---|---|---|---| |1|17697|.|G|C|.|.| |1|19890|.|T|C|.|.|
"#CHROM", "POS", "REF", "ALT" are essential columns that will be parsed and utilized by DeepMosaic.
While using MuTect2 we recommend "PASS" vcfs as input for DeepMosaic. Running MuTect2 single mode, generate the panel of normals and downstream filtering could either be found following the official GATK tutorials, or following this example snakemake pipeline.
- The outputs files including the extracted features and encoded imaged will be output to
[output_dir]. DeepMosaic will create a new direct
Related Skills
YC-Killer
2.7kA library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.
best-practices-researcher
The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app
groundhog
400Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).
last30days-skill
19.5kAI agent skill that researches any topic across Reddit, X, YouTube, HN, Polymarket, and the web - then synthesizes a grounded summary
