Samovar
Somatic (mosaic) SNV caller for 10X Genomics data using random forest classification and feature-based filters
Install / Use
/learn @cdarby/SamovarREADME
Samovar is a mosaic SNV caller for 10x Genomics linked-read WGS data. Starting from the phased VCF and BAM files output by 10x Genomics Longranger pipeline, Samovar first selects candidate variant sites by filtering sites based on features of the aligned reads, such as the depth of coverage, that contribute to false positive variant calls (1). A random forest tree model is trained (3) with simulated mosaic variants as positive examples (2) and real homozygous/heterozygous variants (2) called in the sample by Longranger as negative examples. Candidate variant sites are scored with this model (4). Repeat regions and non-diploid copy-number regions are optionally filtered out (5) before a final filter removes false positives resulting from alignment errors (6) to produce scored variant calls (7).
<p align="center"> <img src="resources/samovar_wf.png"/> </p>Dependencies and Configuration
Requires: Python (v.3 recommended, v.2 compatible) with pyfaidx; scikit-learn; simplesam; fisher; and installations of samtools, bedtools. Some steps are compatible with pypy
MARCC The SLURM sbatch parameter --ntasks N should be set to accompany Python script parameter --nproc N
Installation
- Change the INSTALLDIR, PYTHONCMD, and PYPYCMD variables in the
./samovarfile to match the location of the Samovar code and the command-line invocations of Python and pypy you wish to use. If you do not have pypy, just set PYPYCMD to the Python command-line invocation. - The
./samovarfile can then be moved to your/bin/directory, or you can add the location to the PATH variable. - Download the region filter BED files here
- Install required packages:
pip install -r requirements.txt
Main Pipeline
./samovar
Options: generateVarfile, simulate, train, preFilter, classify, postFilter (run without options to see arguments)
samovar generateVarfile
Generates a tab-separated file with fields contig \t site \t VAF to specify where mosaic-like sites will be simulated
samovar generateVarfile --out out.varfile --vcf sample.vcf --fai genome.fa.fai
--out output varfile name [out.varfile]
--vcf sites to avoid in simulation, e.g. the VCF from your sample (takes comma-separated multiple file names)
--fai FASTA file index of the genome from samtools faidx
--variantspacing spacing along the genome of simulated sites [80000]
--vafspacing increment of variant allele fraction [0.05]
Notes:
- 25,000-30,000 mosaic-like sites are suggested. Running this step with --variantspacing 80000 usually results in enough sites output from the simulation step. Decrease --variantspacing to increase the number of mosaic-like sites.
- This step takes only a few minutes, mostly spent reading the VCF file. It is not parallelized and pypy is not necessary.
- If you do not want to use miscellaneous contigs and/or sex chromosomes, make a new file for the
--faiargument that contains only the contigs you want to use for mosaic-like sites.
samovar simulate
Simulates mosaic-like training examples at sites specified in varfile where there is at least one haplotype-discordant read
samovar simulate --bam sample.bam --varfile out.varfile --simulate --nproc 4 > mosaic.features.tsv
--simulate mosaic site simulation mode
--bam bam file to simulate sites from
--varfile varfile from 0.setup/generateVarfile.py of sites and VAF
--nproc uses multiprocessing.Pool (value <= 1 does not import the module and runs in serial)
Notes:
- Records are printed to stdout, the output order may differ from the input order if multiple processes are used
- With pypy, 4 processes, and 30,000 sites in the varfile, this step takes 30 minutes. (SLURM shared partition default memory 5G/task is sufficient)
Computes features at known homozygous/heterozygous sites from VCF that have at least one haplotype-discordant read
samovar simulate --bam sample.bam --varfile sample.vcf --het --max 50000 --nproc 4 > het.features.tsv
--het will extract features from sites with genotype 1|0\|0|1 that PASS filter and are SNP [i.e. len(ref) == 1 and len(alt) == 1]
--hom will extract features from sites with genotype 0|0\|1|1 that PASS filter and are SNP [i.e. len(ref) == 1 and len(alt) == 1]
--bam bam file to simulate sites from
--varfile sites to extract features from, e.g. the VCF from your sample
--max maximum number of sites to extract features from
--nproc uses multiprocessing.Pool (value <= 1 does not import the module and runs in serial)
Must catch output from stdout.
Notes:
- Records are printed to stdout, the output order may differ from the input order if multiple processes are used
- Based on the number of mosaic-like sites, you should set --max to half that number in each of --hom and --het mode.
- Because more sites must be examined to find enough examples with haplotype-discordant reads, with pypy, 4 processes, and --max 15000, this step takes 30-90 minutes for modes --het and --hom. (SLURM shared partition default memory 5G/task is sufficient)
samovar train
Uses feature vectors from mosaic-like, heterozygous, and homozygous sites to train a random forest classifier
samovar train --mosaic mosaic.features.tsv --het het.features.tsv --hom hom.features.tsv
--mosaic file from --simulate in 1.simulate/simulate.py
--het file from --het in 1.simulate/simulate.py
--hom file from --simulate in 1.simulate/simulate.py Equal number of het and hom examples are randomly selected to give equal number of mosaic and germline training examples.
--out classifier file in Python's pickle binary format [clf.pkl]
--mindepth read coverage required to use a site as a training example [16] ]
--nestimators [100] maxleafnodes [50] scikit-learn random forest hyperparameters
--modelsize maximum number of mosaic training examples [None: use all data]
Notes:
- Ensure that the python and scikit-learn versions used to train and dump the classifier are the same as used to load and use the classifier later on.
- This step should just take a few minutes, and is not parallelized or pypy-compatible.
samovar preFilter
Scans genome, calculates features for sites that pass all filters. Outputs "vectors" for passing sites which will later be classified and ranked by the random forest
samovar preFilter --bam sample.bam --nproc 48 > vectors.txt 2> intervalsComplete.txt
--bam bam file of your sample
--nproc uses multiprocessing.Pool (value <= 1 does not import the module and runs in serial)
Notes:
- Records are printed to stdout, the output order may not be in order along the genome if multiple processes are used
- Regions are printed to stderr when they are completed.
- To decrease memory usage at possible loss of speed, decrease the "performance parameters" in the script. WINDOW_SIZE is the number of bases each task scans at a time. READBATCH and SITEBATCH control how often the main data structure is pruned.
- To adjust filter thresholds, change the "filter parameters" in the script. To consider sites with more (less) evidence of mosaicism, increase (decrease) MIN_MAF and MIN_HAPDISCORD_READS. To consider sites with more (less) phased reads, increase (decrease) MIN_FRAC_PHASED. To consider sites with higher (lower) quality phasing, increase (decrease) MIN_FRAC_HAP1, the haplotype imbalance of all phased reads, and decrease (increase) MAX_HAPDISCORD_READS_HAP, the haplotype imbalance of haplotype-discordant reads.
- If you want to include any contigs besides the autosomes and chrX, add the names to the CONTIGS list. (if the sample is male, the reads on chrX will not have the HP tag from longranger and no calls will be made)
- Processes have been observed that use >100G of memory on a single interval, and this has not yet been investigated and solved. [MARCC] Use SLURM
--partition=lrgmem --nodes=1 --ntasks-per-node=48 --mem=800G --time=12:0:0which takes 3-5 hours to scan the whole genome, producing 50,000-100,000 candidate vectors (30-70MB output file).
samovar classify
Uses the random forest model to rank the feature vectors, outputting those with a higher-than-threshold probability in an output file with format: contig \t position \t position+1 \t Samovar score \t read depth \t Number of haplotype discordant read \t MAF \t minor allele base
samovar classify --clf clf.pkl --vectors vectors.txt > predictions.tsv
--clf classifier trained from 2.train/train.py
--vectors output file from filter step
Notes:
- Records are printed to stdout
- Ensure that the python and scikit-learn versions used to train and dump the classifier earlier the same as used to load and use the classifier here.
- This step should just take a few minutes, and is not parallelized or pypy-compatible.
- An invocation of
eval()in the python script may report an unexpected EOF in the middle of the file. Replacing this character with a newline solves the issue, which has not yet been investigated and solved. - If you do not want to use miscellaneous contigs and/or sex chr
