Daligner: The Dazzler "Overlap" Module

Author: Gene Myers

First: April 10, 2016

Current: April 19, 2019

For typeset documentation, examples of use, and design philosophy please go to my blog.

Version Numbers

v1.0 has been released, but if you need to refer to a later revision from the stable master branch, please use v1.0.yyyymmdd where yyyy-mm-dd is the date of the commit used. This is important for method details in scientific papers, and for software packaging (e.g. Conda, HomeBrew, or Linux distribution packages).

The commands below permit one to find all significant local alignments between reads encoded in Dazzler database. The assumption is that the reads are from a PACBIO RS II long read sequencer. That is the reads are long and noisy, up to 15% on average.

Recall that a database has a current partition that divides it into blocks of a size that can conveniently be handled by calling the daligner overlapper on all the pairs of blocks producing a collection of .las local alignment files that can then be sorted and merged into an ordered sequence of sorted files containing all alignments between reads in the data set. The alignment records are parsimonious in that they do not record an alignment but simply a set of trace points, typically every 100bp or so, that allow the efficient reconstruction of alignments on demand.

All programs add suffixes (e.g. .db, .las) as needed. For the commands that take multiple .db or .las file blocks as arguments, i.e. daligner, LAsort, LAmerge, LAcat, and LAcheck, one can place a @-sign in the name, which is then interpreted as the sequence of files obtained by replacing the @-sign by 1, 2, 3, ... in sequence until a number is reached for which no file matches. One can also place a @-sign followed by an integer, say, i, in which case the sequence starts at i. Lastly, one can also use @i-j where i and j are integers, in which case the sequence is from i to j, inclusive.

The formal UNIX command line descriptions and options for the DALIGNER module commands are as follows:

1. daligner [-vaAI]
       [-k<int(16)>] [-%<int(28)>] [-h<int(50)>] [-w<int(6)>] [-t<int>] [-M<int>]
       [-e<double(.75)] [-l<int(1500)] [-s<int(100)>] [-H<int>]
       [-T<int(4)>] [-P<dir(/tmp)>] [-m<track>]+
       <subject:db|dam> <target:db|dam> ...

Compare sequences in the trimmed <subject> block against those in the list of <target> blocks searching for local alignments involving at least -l base pairs (default 1000) or more, that have an average correlation rate of -e (default 70%). The local alignments found will be output in a sparse encoding where a trace point on the alignment is recorded every -s base pairs of the a-read (default 100bp). Reads are compared in both orientations and local alignments meeting the criteria are output to one of several created files described below. The -v option turns on a verbose reporting mode that gives statistics on each major step of the computation. The program runs with 4 threads by default, but this may be set to any positive value with the -T option.

The options -k, -%, -h, and -w control the initial filtration search for possible matches between reads. Specifically, our search code looks for a pair of diagonal bands of width 2w (default 26 = 64) that contain a collection of matching k-mers (default 16) in the lowest %-percentifle between the two reads, such that the total number of bases covered by the k-mer hits is h (default 50). k cannot be larger than 32 in the current implementation. These parameters will shortly be superceded with a more intuitive interface.

If there are one or more interval tracks specified with the -m option, then the reads of the DB or DB's to which the mask applies are soft masked with the union of the intervals of all the interval tracks that apply, that is any k-mers that contain any bases in any of the masked intervals are ignored for the purposes of seeding a match. An interval track is a track, such as the "dust" track created by DBdust, that encodes a set of intervals over either the untrimmed or trimmed DB.

Invariably, some k-mers are significantly over-represented (e.g. homopolymer runs). These k-mers create an excessive number of matching k-mer pairs and left unaddressed would cause daligner to overflow the available physical memory. One way to deal with this is to explicitly set the -t parameter which suppresses the use of any k-mer that occurs more than t times in either the subject or target block. However, a better way to handle the situation is to let the program automatically select a value of t that meets a given memory usage limit specified (in Gb) by the -M parameter. By default daligner will use the amount of physical memory as the choice for -M. If you want to use less, say only 8Gb on a 24Gb HPC cluster node because you want to run 3 daligner jobs on the node, then specify -M8. Specifying -M0 basically indicates that you do not want daligner to self adjust k-mer suppression to fit within a given amount of memory.

Each found alignment is recorded as -- a[ab,ae] x bo[bb,be] -- where a and b are the indices (in the trimmed DB) of the reads that overlap, o indicates whether the b-read is from the same or opposite strand, and [ab,ae] and [bb,be] are the intervals of a and bo, respectively, that align. For each subject, target pair of blocks, say X and Y, the program reports alignments where the a-read is in X and the b-read is in Y, or vice versa. However, if the -A option is set ("A" for "asymmetric") then just overlaps where the a-read is in X and the b-read is in Y are reported, and if X = Y then it further reports only those overlaps where the a-read index is less than the b-read index. In either case, if the -I option is set ("I" for "identity") then when X = Y, overlaps between different portions of the same read will also be found and reported. In summary, the command daligner -A X Y produces a single file X.Y.las and daligner X Y produces 2 files X.Y.las and Y.X.las (unless X=Y in which case only a single file, X.X.las, is produced). The overlap records in one of these files are sorted as described for LAsort. The -a option to daligner is passed directly through to LAsort which is actually called as a sub-process to produce the sorted file. In order to produce the aforementioned .las file, several temporary .las files, two for each thread, are produce in the sub-directory /tmp by default. You can overide this location by specifying the directory you would like this activity to take place in with the -P option.

By default daligner compares all overlaps between reads in the database that are greater than the minimum cutoff set when the DB or DBs were split, typically 1 or 2 Kbp. However, the HGAP assembly pipeline only wants to correct large reads, say 8Kbp or over, and so needs only the overlaps where the a-read is one of the large reads. By setting the -H parameter to say N, one alters daligner so that it only reports overlaps where the a-read is over N base-pairs long.

While the default parameter settings are good for raw Pacbio data, daligner can be used for efficiently finding alignments in corrected reads or other less noisy reads. For example, for mapping applications against .dams we run daligner -k20 -h60 -e.85 and on corrected reads, we typically run daligner -k25 -w5 -h60 -e.95 -s500 and at these settings it is very fast.

2. LAsort [-va] <align:las> ...

Sort each .las alignment file specified on the command line. For each file it reads in all the overlaps in the file and sorts them in lexicographical order of (a,b,o,ab) assuming each alignment is recorded as a[ab,ae] x bo[bb,be]. It then writes them all to a file named <align>.S.las (assuming that the input file was <align>.las). With the -v option set then the program reports the number of records read and written. If the -a option is set then it sorts LAs in lexicographical order of (a,ab) alone, which is desired when sorting a mapping of reads to a reference.

If the .las file was produced by damapper the local alignments are organized into chains where the LA segments of a chain are consecutive and ordered in the file. LAsort can detects that it has been passed such a file and if so treats the chains as a unit and sorts them on the basis of the first LA in the chain.

3. LAmerge [-va] [-P<dir(/tmp)>] <merge:las> <parts:las> ...

Merge the .las files <parts> into a singled sorted file <merge>, where it is assumed that the input <parts> files are sorted. There are no limits to how many files can be merged, but if there are more than 252, a typical UNIX OS limit on the number of simultaneously open files, then the program recursively spawns sub-processes and creates temporary files in the directory specified by the -P option, /tmp by default. With the -v option set the program reports the number of records read and written. The -a option indicates the sort is as describe for LAsort above.

If the .las file was produced by damapper the local alignments are organized into chains where the LA segments of a chain are consecutive and ordered in the file. When merging such files, LAmerge treats the chains as a unit and orders them on the basis of the first LA in the chain.

Used correctly, LAmerge and LAsort together allow one to perform an "external" sort that produces a collection of sorted files containing in aggregate all the local alignments found by the daligner, such that their concatenation is sorted in order of (a,b,o,ab) (or (a,ab) if the -a option is set). In particular, this means that all the alignments for a given a-read will be found consecutively in one of the files. So computations that need to look at all the alignments for a given read can o

DALIGNER

Install / Use

README

Daligner: The Dazzler "Overlap" Module

Author: Gene Myers

First: April 10, 2016

Current: April 19, 2019

Version Numbers