minialign

Minialign is a little bit fast and moderately accurate nucleotide sequence alignment tool designed for PacBio and Nanopore long reads. It is built on three key algorithms, minimizer-based index of the minimap overlapper, array-based seed chaining, and SIMD-parallel Smith-Waterman-Gotoh extension.

Announcements

Latest: 0.6.0; stable: 0.4.4
2018/2/13: Version 0.6.0: Better alignment computation with 4x4 score matrix, piecewise affine-gap penalty function, and wider band. Added circular reference sequence support. Added universal binary target. All-versus-all mode is temporarily disabled (please use 0.5.3).
2017/6/8: Version 0.5.3: Fix a bug in FASTA/Q parser.
2017/4/3: Version 0.5.0 is released. New features: SA (supplementary alignment) and MD (mismatch position) tags are enabled with -TSA and -TMD flags.

Getting started

C99 compiler (gcc / clang / icc) is required to build the program.

$ make && make install	# PREFIX=/usr/local by default
$ make universal && make install        # universal binary for HPC clusters
$ minialign -xont.1dsq reference.fa reads.[fa,fq,bam] > read_to_ref.sam

Reference sequence index can be stored in separate. Using prebuilt index saves around a minute per run for a human haploid (~3G) genome.

$ minialign -d index.mai reference.fa	# build index
$ minialign index.mai reads.[fa,fq,bam] > out.sam	# mapping on prebuilt index

Frequently used options are: scoring parameters, minimum score cut-offs, and number of threads.

$ minialign -a2 -b5 -p5 -q1 -r3,3		# match, mismatch, (gap-open for gap-extend large gaps), and gap-extend for short gaps
$ minialign -eAG+2,GA+1                 # score matrix modifier: add 2 to A -> G substitution, and 1 to G -> A substitution
$ minialign -s1000	# set minimum score threshold to 1000
$ minialign -m0.8	# set report threshold at 0.8 of the highest score for every read
$ minialign -t10	# minialign is now 10x faster!!!

Benchmarks

All the following benchmarks were took on Intel i5-6260U (Skylake, 2C4T, 2.8GHz, 4MBL3) with 32GB (DDR4, 2133MHz) RAM.

Speed

| Time (sec.) | minialign | DALIGNER | BWA-MEM | |:----------------------------------------------------:|:-----------:|:-----------:|:-----------:| | E.coli (MG1655) x100 simulated read (460Mb) to ref. | 10.0 | 39.5 | 6272 | | S.cerevisiae (sacCer3) x100 sim. (1.2Gb) to ref. | 29.3 | 1134* | 10869 | | D.melanogaster (dm6) x20 sim. (2.75Gb) to ref. | 82.6 | - | 31924 | | Human (hg38) x3 sim. (9.2Gb) to ref. | 648 | - | - |

Notes: Execution time was measured with the unix time command, shown in seconds. Dashes denote untested conditions. Program version information: minialign-0.4.6, DALIGNER-ca167d3 (commit on 2016/9/27), and BWA-MEM-0.7.15-r1142-dirty. All the programs were compiled with gcc/g++-5.4.0 providing the optimization flag -O3. PBSIM (PacBio long-read simulator), modified version based on 1.0.3 (1.0.3-nfree) not to generate reads containing N's, was used to generate read sets. Simulation parameters (len-mean, len-SD, acc-mean, acc-SD) were fixed at (20k, 2k, 0.88, 0.07) in all the samples. Arguments passed to the programs; minialign: -t4 -xpacbio, DALIGNER: -T4, and BWA-MEM: -t4 -xpacbio -A1 -B2 -O2 -E1 (overriding scoring parameters based on the PacBio defaults). Index construction (minialign and BWA-MEM) and format conversion time (DALIGNER: fasta -> DB, las -> sam) are excluded from the measurements. Peak RAM usage of minialign was around 12GB in human read-to-ref mapping with four threads. Starred sample, S.cerevisiae on DALIGNER, was splitted into five subsamples since the whole concatenated fastq resulted in an out-of-memory error. Calculation time of the subsamples were 61.6, 914.2, 56.8, 50.3, and 51.5 seconds, where the second trial behaved a bit strangely with too long calculation on one (out of four) threads.

Recall-precision trend

Recall-precision trend (D.melanogaster)

(a). D.melanogaster (dm6) x10 (1.4Gb)

Recall-precision trend (Human)

(b). Human (hg38) x0.3 (1Gb)

Notes: Recall is defined as: a proportion of reads whose originating region, represented by (spos, epos) coordinate pair on the reference, has at least one non-zero-length intersection with its alignments. Precision is defined as: a proportion of alignment records which has a non-zero-length intersection with its originating region. The recall and precision pairs were calculated from the output sam files, filtered with different mapping quality thresholds between 0 and 60. Duplicate alignments were not filtered out from the output sam files. Program version information: minialign-0.4.6, GraphMap-0.4.0, BLASR-0014a57 (the last commit with the SAM output), and BWA-MEM-0.7.15-r1142-dirty. Read set was generated by the PBSIM-1.0.3-nfree with the parameters (len-mean, len-SD, acc-mean, acc-SD) set to (10k, 10k, 0.8, 0.2) without ALT / random contigs. Reads were mapped onto the corresponding reference genomes including ALT / random contigs. Arguments passed to the programs; minialign: -t4 -xpacbio, GraphMap: -t 4, BLASR: --nproc 4, and BWA-MEM: -t4 -xpacbio. Calculation time and peak RAM usages are shown in the table below.

| Time (sec.) / Peak RAM (GB) | minialign | GraphMap | BLASR | BWA-MEM | |:-------------------------------------:|:-----------:|:-----------:|:-----------:|:-----------:| | D.melanogaster (dm6) x10 sim. (1.4Gb) | 51.3 / 2.2 | 6482 / 4.3 | 30081 / 1.0 | 37292 / 0.5 | | Human (hg38) x0.3 sim. (1Gb) | 87.4 / 12 | - | - | 34529 / 5.5 |

Effect of read length and score threshold on recall

readlength-recall trend (Human)

Notes: The solid lines in the figure shows the proportions of mapped reads. The dashed lines shows the recalls, defined as above. The minimum alignment score threshold (-s) was differed among 50, 100, 200, and 400. Reads were generated from hg38 without ALT / random contigs using PBSIM-1.0.3-nfree with the (len-mean, len-SD, acc-mean, acc-SD) parameters set to (2000, 2000, 0.88, 0.07). Reads were mapped onto the reference with ALT / random contigs included. Minialign-0.4.4 was run with the -t4 -xpacbio flags and the additional -s parameters.

Algorithm overview

The algorithm is roughly based on the seed-and-extend strategy with additional seed filtering and chaining stages. The seed filtering and chaining stages are essential to filter out futile or redundant extension trials, where long and erroneous query sequences resulting in numerous false positives and a small amount of consecutive correct hits. The effectiveness of the seed chaining is first shown by Chaisson and Tessler in the BLASR algorithm [1], and also in the BWA-MEM [2]. The additional filtering stage is introduced by the GraphMap [3], improved overall computational performance combined with the chaining algorithm. The minimap algorithm [4] adopted the minimizer [5] and an invertible hash function [6] to reduce seeds to be enumerated at the indexing stage.

Minimizer-based index structure

Minialign also adopted the invetible hash function [6] and the minimizer-based seed filtering [4] of the minimap. The same indexing parameters, the k-mer length and the window size, are adopted as the default value, sustaining the equivalent information as the minimap program. Some modifications were applied to improve simplicity and support a reference subsequence retrieval query.

Array-based seed chaining

An array-based seed chaining algorithm was introduced to efficiently detect a stream of seeds without capturing large insertions and deletions. In each chain extension trial, the downstream seeds are filterd with a "chainable window", which is a 30-degree-angled parallelogram placed at the right-bottom direction of the tailmost seed, preventing it from introducing unrealistically large gaps in the chain.

The whole chaining algorithm with an incremental repetitive-seed filtering was implemented as follows;

0.  occ[] is a precalculated array of occurrences correspond to the top [5%, 1%, 0.1%]
    frequent minimizers on the reference.
1.  Collect minimizers on the query to a seed bin.
2.  for (i = 0; there remains elements in the seed bin; i++) {
3.    Clear a result bin.
4.    Move seeds which occurs less than a threshold, occ[i], to a chain array.
5.    Sort the elements in the chain array by their `rpos - 2*qpos` values.
      Mark them as unchained.
6.    while (there remains unchained elements in the chain array) {
7.      Mark the first unchained seed in the array as a root of a chain.
        Initialize a tail-of-chain pointer to point at the root.
8.      if (the next seed in the chain array is inside the chainable window of the tail) {
9.        Mark it as chained and move the tail pointer to it.
10.     } else if (there is no seed in the chainable window) {
11.       Terminate the extension and save the resulting chain in the result bin, continue to 6.
        }
      }
12.   if (there is no meaningful chain in the result bin) { Then continue to 2. }
13.   Return the result bin;
    }

The rectangular-inclusion test on the line 8 is implemented with comparisons of rpos - qpos/2 values of the two adjacent seeds, wh

Minialign

Install / Use

README