Edyeet
base-accurate DNA sequence alignments using edlib and mashmap2
Install / Use
/learn @ekg/EdyeetREADME
edyeet
edyeet is a fork of MashMap that implements base-level alignment using edlib, via the wflign tiled wavefront global alignment algorithm.
It completes MashMap with a high-performance alignment module capable of computing base-level alignments for very large sequences.
process
Each query sequence is broken into non-overlapping pieces defined by -s[N], --segment-length=[N].
These segments are then mapped using MashMap's sliding minhash mapping algorithm and subsequent filtering steps.
To reduce memory, a temporary file is used to store initial mappings.
Each mapping location is then used as a target for alignment using edlib.
The resulting alignments always contain extended CIGARs in the cg:Z:* tag.
Approximate mapping (equivalent to MashMap) can be obtained with -m, --approx-map.
Mapping merging is disabled by default, as aligning merged approximate mappings with edlib under reasonable identity bounds can generate very long runtimes.
However, merging can be useful in some settings and is enabled with -M, --merge-mappings.
Sketching, mapping, and alignment are all run in parallel using a configurable number of threads.
The number of threads must be set manually, using -t, and defaults to 1.
usage
edyeet has been developed to accelerate the alignment step in variation graph induction (the first step in the seqwish / smoothxg pipeline).
Suitable default settings are provided for this purpose.
Four parameters shape the length, number, and identity of the resulting mappings:
-s[N], --segment-length=[N]is the length of the mapped and aligned segment (when-Nis not set)-N, --no-splitavoids splitting queries into segments, and instead maps them in their full length-p[%], --map-pct-id=[%]is the percentage identity minimum in the mapping step-n[N], --n-secondary=[N]is the maximum number of mappings (and alignments) to report for each segment abovesegment-length(the number of mappings for sequences shorter than the segment length is defined by-S[N], --n-short-secondary=[N], and defaults to 1)-a[%], --align-pct-id=[%]defines the minimum percentage identity alignment to report from the _alignment_step
Together, these settings allow us to precisely define an alignment space to consider.
During all-to-all mapping, -X can additionally help us by removing self mappings from the reported set, and -Y extends this capability to prevent mapping between sequences with the same name prefix.
examples
Map a set of query sequences against a reference genome:
edyeet reference.fa query.fa >aln.paf
Setting a longer segment length to reduce spurious alignment:
edyeet -s 50000 reference.fa query.fa >aln.paf
Self-mapping of sequences:
edyeet -X query.fa query.fa >aln.paf
sequence indexing
edyeet provides a progress log that estimates time to completion.
This depends on determining the total query sequence length.
To prevent lags when starting a mapping process, users should apply samtools index to index query and target FASTA sequences.
The .fai indexes are then used to quickly compute the sum of query lengths.
installation
The build is orchestrated with cmake:
cmake -H. -Bbuild && cmake --build build -- -j 16
The edyeet binary will be in build/bin.
To clean up, just remove the build directory.
<a name=“publications”></a>publications
-
Chirag Jain, Sergey Koren, Alexander Dilthey, Adam M. Phillippy, and Srinivas Aluru. "A Fast Adaptive Algorithm for Computing Whole-Genome Homology Maps". Bioinformatics (ECCB issue), 2018.
-
Chirag Jain, Alexander Dilthey, Sergey Koren, Srinivas Aluru, and Adam M. Phillippy. "A fast approximate algorithm for mapping long reads to large reference databases." In International Conference on Research in Computational Molecular Biology, Springer, Cham, 2017.
-
Martin Šošić and Mile Šikić "Edlib: a C/C ++ library for fast, exact sequence alignment using edit distance", Bioinformatics, 2017.
