Mixemt
A tool for phylogenetic interpretation of human mtDNA sequence mixtures
Install / Use
/learn @svohr/MixemtREADME
mixemt - Deconvolving mtDNA Mixtures by Expectation Maximization
mixemt is a program for making sense of mixtures of human mitochondrial
sequences. It takes as input a BAM file containing reads mapped to the human
mitochondrial reference sequence and uses a representation of the
mitochondrial haplogroup phylogeny from Phylotree.org
to infer the haplogroups present in the mixture. mixemt scans each mapped
read for variants described in Phylotree and estimates the probability of the
read originating from each candidate haplotype (haplogroup). This information
is used to estimate the mixture proportions for the sample and to identify the
most likely haplogroups contributing to the sample. The program produces as
output a table, written to standard output, that describes the detected
haplogroups and their estimated percent contribution to the sample. The program
can also produce new BAM files of input fragments partitioned by the most
likely contributor of origin, tab-delimited files containing statistics for
each reference position, and verbose output detailing the results from each
step.
A detailed description and evaluation of the approach implemented in
mixemt can be found in the article:
Samuel H. Vohr, Rachel Gordon, Jordan M. Eizenga, Henry A. Erlich, Cassandra D. Calloway, Richard E. Green. A phylogenetic approach for haplotype analysis of sequence data from complex mitochondrial mixtures. Forensic Science International: Genetics. Vol. 30, 93-105. 2017. [link]
If you would like to cite mixemt in a publication, please refer to this
article.
Requirements
mixemt is written in Python and requires a few additional packages:
R and ggplot2 are required
to use the plotting scripts included in the directory plot/.
Installation
mixemt can be installed from this repository using pip.
git clone https://github.com/svohr/mixemt.git
cd mixemt
pip install .
Missing requirements will be automatically downloaded and installed.
Usage
mixemt [options] <reads.bam>
Preparing input sequences
mixemt takes as input a BAM file containing alignments of sequences to a
reference sequence. The Reconstructed Sapiens Reference Sequence (RSRS) is
typically used the reference sequence, although reads may be mapped to the
Revised Cambridge Reference Sequence (rCRS) as both share a common coordinate
system.
Aligned sequences are provided to mixemt as a BAM file. Alignments should be
filtered prior to input using samtools or a similar program. Internally,
mixemt treats alignments sharing the same query template name as a single
fragment and does not check read pairs for orientation, insert size, unaligned
segments, or secondary alignments. The options -q and -Q allow for basic
filtering of alignments and base calls based on map quality and base quality
scores.
Since the mitochondrial genome is circular, it may be advantageous to align reads across the junction of the start and end of the reference sequence, split the reads, and correct the mapping coordinates. A basic implementation of these steps can be found in the repository circ_aln.
Options
Basic options
-h, --help
Print help message describing usage and options.
-v, --verbose
Print detailed status with useful information while running.
--ref ref.fasta
Specify a FASTA file containing the reference sequence. The first sequence in
the file will be used. The reference sequence must match the variants
contained in the Phylotree input to avoid unexpected behavior. The
Reconstructed Sapiens Reference Sequence (RSRS) contained in
mixemt/ref/RSRS.mtDNA.fa is used by default.
--phy phylotree.csv
Specify a Phylotree CSV file to use. By default, the file for Phylotree
Build 17 (mixemt/phylotree/mtDNA_tree_Build_17.csv) is used.
Customization options
-H custom.tab, --haps custom.tab
Use the haplotypes described in the specified file along with the
haplogroups from Phylotree. A haplotype file consists of lines representing
haplotypes. Each line contains a distinct haplotype identifier, a tab
character, and a comma-separated list of variants in the form
[Ancestral/Reference Base][position][Derived/Variant Base].
custom_hap1 C152T,A2758G,C2885T,G7146A,T8468C
custom_hap2 C4670T,A5282G,C9007T,G12192A,T15604C
Positions that do not appear in the variant list are assumed to contain the reference base and positions that have been excluded from analysis are not used. Variants that appear at novel positions (i.e., no known variants in Phylotree) can be used, but it is assumed that all Phylotree haplogroups carry the reference base.
-e SITES, --exclude_pos SITES
Specify a comma-separated list of 1-based positions or 'start-end' ranges to be excluded from consideration.
-A, --anon-haps
Ignore Phylotree haplogroups without IDs. By default, a placeholder name is
generated for these haplogroups using the parent haplogroup's ID and an integer
number (e.g., H1[2] is a unnamed subgroup of haplogroup H1).
-U, --unstable
Ignore sites with variants listed as unstable in Phylotree.
Quality filters
These options provide basic filtering of mapped sequences and base observations.
-q INT, --min-MQ INT
Ignore alignments with mapping quality less than INT (default: 30).
-Q INT, --min-BQ INT
Ignore bases with base quality scores less than INT (default: 30).
Expectation-maximization
Options for setting the parameters used during the expectation-maximization step where the mixture proportions are inferred. The default values should work well most of the time, but adjusting these values may be useful in some situations.
-S N --seed N
Use value N to seed the random number generator that generates starting
mixture proportions. Allows mixemt to be run multiple times using the same
input and produce the same results each time. If unset, the starting mixture
proportions will differ with each run.
-i ALPHA, --init ALPHA
Use parameter ALPHA to initialize haplogroup contributions from Dirichlet
distribution. Set to 'inf' to give haplogroups equal priors. (default: 1.0)
-T TOLERANCE, --converge TOLERANCE
Stop EM iteration when absolute difference between current and previous
contribution estimates is less than TOLERANCE (default: 0.0001)
-m N, --max-em-iter N
Maximum of number of EM iterations to run (default: 10000)
-M N, --multi-em N
Run EM until convergence multiple times and report the results averaged over all runs (default: 1)
Contributor detection options
Once the mixture proportions have been inferred, two filtering steps are
applied to narrow the field of contributing to haplotypes to only the
strongest candidates. First, we examine the read/haplogroup probabilities
to find the haplogroup associated with the highest probability for each
read. Haplogroups that have no or little read support are removed from
consideration (see -r option). Second, using the haplogroups that pass
the previous filter, we verify that the variants that define each contributor
are in fact present in the mixture. This step removes haplotype signals
that are most likely driven by one or two private mutations.
-C HAP1,HAP2,...,HAPN, --contributors HAP1,HAP2,...,HAPN
Skip contributor detection step and use the specified comma-separated list of haplogroups instead (be careful).
-r N, --min-reads N
Haplogroup must have N reads to be considered a contributor (default: 10). This value should be adjusted when coverage is high to avoid signals caused by sequencing errors and when coverage is low and allelic dropout may be likely.
-R N, --var-min-reads N
Variant base must be found in N reads to be considered as present in sample
(default: 3). This value should be above the expected number of base errors
given the sequence coverage.
-F F, --var-fraction-min-reads F
Variant base must be found in fraction F of reads to be considered as present in sample. A minimum of reads can be set with the -R option for when this is low or can be low (default: 0.02)
-f F, --var-fraction F
Fraction of unique defining variants that must be observed to call a haplogroup present (default: 0.5). This value should be adjusted based on the likelihood of allelic dropout and the number of variant differences between contributors.
-n N, --var-count N
Call haplogroup a contributor if it has at least N unique variants observed in
the sample, regardless of total number of defining variants. Use when allelic
dropout is likely (default: None).
-V, --no-var-check
Disable requirement that the majority of contributors' unique defining variants are present in the sample. Use when coverage is very low and dropout is likely.
-E
Skip contribution estimate refinement and report proportions from initial EM run.
Assembly options
-a ODDS, --assign-odds ODDS
Minimum odds ratio (probability between best and second best haplogroup) required to assign read to a contributor (default: 2.0)
-x, --extend-assemblies
Attempt to extend haplotype assemblies iteratively by identifying novel variants from contributor consensus sequences assigning reads based off of them.
-c N, --cons-cov N
When extending assemblies with -x, sets the depth of coverage required to call
a base for a contributor (default: 2)
