Wfa
Wavefront alignment algorithm (WFA) in Golang
Install / Use
/learn @shenwei356/WfaREADME
Wavefront alignment algorithm (WFA) in Golang
This golang package implements Wavefront alignment algorithm (WFA), not BiWFA (maybe in the future).
Implemented features:
- Distance metrics: gap-affine.
- Alignment types: global, semi-global.
- Heuristics: wf-adaptive.
Executable binaries for common operating systems are available to download, for benchmarking with lots of sequences, or for fast aligning two sequences in the command line.
$ wfa-go -g "Bioinformatics helps Biology" "We learn bioinformatics to help biologists"
query ---------Bioinformatics ---helps Biology---
|||||||||||||| |||| | |||||
target We learn bioinformatics to help- biologists
cigar 9I1X14M3I4M1D1M1X5M1X3I
align-score : 32
match-region: q[2, 27]/28 vs t[11, 38]/42
align-length: 29, matches: 24 (82.76%), gaps: 4, gap regions: 2
Table of Contents
History
- I need a fast DNA alignment package in Golang for my project.
- WFA "looks easy" to implement as it does not heavily reply on SIMD intrinsics, though there are some other algorithms performing well in a benchmark.
- I'm not familiar with C++, and I found it difficult to understand the official code. I only found one (in Rust) 3rd party implementation.
- After reading the WFA paper, I thought the algorithm is easy, so I implemented WFA from the scratch.
- Later I found it not easy, there were so many details.
After reading the
nextstep in the rust version, I realised it's because I don't know gap-affine penalties. - The
backtracestep is the most difficult part. In v0.1.0, I used 3 extra components to store the source offsets for each cell in I, D, M components. But it needs more memory. And the speed is also not ideal, 1/20 of official version.- Besides, I checked the bases again when tracing back matches. WFA did this too, but WFA2 did not.
- Next, aftering reading thofficial implementation, I rewrote the whole project, using similar backtrace workfow with WFA2. The speed increased to 1/10 of the official version.
- C++ is wild, it even support accessing list/array elements with negative indexes (the diagonal k).
Details
-
A WFA component is saved as a list of WaveFront
[]*WaveFront.- Score
sis the index. - If the value is
nil, it means the score does not exsit.
- Score
-
A WaveFront is saved as a list of offsets
[]uint32. We preset the list with a big length (2048) to avoid frequentappendoperations.-
Diagonal
kis the index. To support negativekvalues, we use ZigZag encoding:index: 0, 1, 2, 3, 4, 5, 6 k: 0, -1, 1, -2, 2, -3, 3Value
0means thekdoes not exist. -
Offsets are saved with
uint32integers, with the lower 3 bits for saving 6 possible paths which are used for backtrace.wfaInsertOpen uint32 = iota + 1 wfaInsertExt wfaDeleteOpen wfaDeleteExt wfaMismatch wfaMatch
-
-
Maximum sequence length: 512 Mb,
536,870,911(1<<(32-3) - 1). -
All objects are saved in object pools for computation efficiency. Just don't forget to recycle them.
Examples
Each WFA component can be visualized as a table. A table cell contains the alignment type symbol and the score.
⊕ Unknown,
⟼ Gap open (Insertion)
🠦 Gap extension (Insertion)
↧ Gap open (Deletion)
🠧 Gap extension (Deletion)
⬂ Mismatch
⬊ Match
Global alignment
| | | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |:-:|:-:|:----:|:---:|:----:|:----:|:---:|:---:|:---:|:---:|:---:|:---:| | | | A | G | G | A | T | G | C | T | C | G | |1 |A |⬊ 0 |⟼ 8 |🠦10 |🠦12 | . | . | . | . | . | . | |2 |C |↧ 8 |⬂ 4 |⬂12 | . | . | . | . | . | . | . | |3 |C |🠧10 |⬂12 |⬂ 8 | . | . | . | . | . | . | . | |4 |A |🠧12 | . | . |⬊ 8 | . | . | . | . | . | . | |5 |T | . | . | . | . |⬊ 8 | . | . | . | . | . | |6 |A | . | . | . | . | . |⬂12 | . | . | . | . | |7 |C | . | . | . | . | . | . |⬊12 | . | . | . | |8 |T | . | . | . | . | . | . | . |⬊12 | . | . | |9 |C | . | . | . | . | . | . | . | . |⬊12 | . | |10 |G | . | . | . | . | . | . | . | . | . |⬊12 |
CIGAR: 1M2X2M1X4M
query ACCATACTCG
| || ||||
target AGGATGCTCG
align-score : 12
align-region: q[1, 10] vs t[1, 10]
align-length: 10, matches: 7 (70.00%), gaps: 0, gapRegions: 0
Semi-global alignment
| | | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | |:-:|:-:|:---:|:---:|:----:|:----:|:----:|:----:|:---:|:----:|:----:|:---:|:----:|:---:| | | | C | A | G | G | C | T | C | C | T | C | G | G | |1 |A |⬂ 4 |⬊ 0 |⬂ 4 |⬂ 4 |⬂ 4 |⬂ 4 |⬂ 4 |⬂ 4 |⬂ 4 |⬂ 4 |⬂ 4 |⬂ 4 | |2 |C |⬊ 0 |⬂ 8 |⬂ 4 |⬂ 8 |⬊ 4 |⬂ 8 |⬊ 4 |⬊ 4 |⬂ 8 |⬊ 4 |⬂ 8 |⬂ 8 | |3 |G |⬂ 4 |⬂ 4 |⬊ 8 |⬊ 4 |⬂12 |⬂ 8 |⬂12 |⬂ 8 |⬂ 8 | . |⬊ 4 |⬊ 8 | |4 |A |⬂ 4 |⬊ 4 |⬂ 8 |⬂12 |⬂ 8 |🠧14 |⬂12 |🠧14 |⬂12 |⬂12 |↧12 |⬂ 8 | |5 |T |⬂ 4 |⬂ 8 |⬂ 8 |⬂12 | . |⬊ 8 | . |⬂16 |⬊14 | . |🠧14 |⬂16 | |6 |C |⬊ 0 |⬂ 8 |🠦10 |⬂12 |⬊12 |🠦18 |⬊ 8 |⟼16 |🠦18 |⬊14 |🠧16 |⬂18 | |7 |T |⬂ 4 |⬂ 4 | . | . | . |⬊12 |↧16 |⬂12 |⬊16 | . |⬂18 |⬂20 | |8 |C |⬊ 0 |⬂ 8 |⬂ 8 |🠦12 |🠦14 |🠦16 |⬊12 |⬊16 |⬂16 |⬊16 |🠧20 | . | |9 |G |⬂ 4 |⬂ 4 |⬊ 8 |⬊ 8 |⬂16 |⬂18 |⬂20 |⬂16 |⬂20 |⬂20 |⬊16 |⬊20 |
CIGAR: 1I1M1X1M1X1M1I4M1I
query -ACGAT-CTCG-
| | | ||||
target CAGGCTCCTCGG
align-score : 16
align-region: q[1, 9] vs t[2, 11]
align-length: 10, matches: 7 (70.00%), gaps: 1, gapRegions: 1
Usages
import "github.com/shenwei356/wfa"
// ------------------[ initialization ]------------------
// aligner
algn := wfa.New(
&wfa.Penalties{
Mismatch: 4,
GapOpen: 6,
GapExt: 2,
},
&wfa.Options{
GlobalAlignment: false,
})
// set adaptive reduction parameters
algn.AdaptiveReduction(&wfa.AdaptiveReductionOption{
MinWFLen: 10,
MaxDistDiff: 50,
CutoffStep: 1,
})
// ------------------[ align one pair of seqs ]------------------
q := []byte("ACCATACTCG")
t := []byte("AGGATGCTCG")
// align
result, err := algn.Align(q, t)
checkErr(err)
// score table of M
algn.Plot(&q, &t, os.Stdout, algn.M, true)
if outputAlignment {
fmt.Println()
fmt.Printf("CIGAR: %s\n", result.CIGAR())
Q, A, T := result.AlignmentText(&q, &t)
fmt.Printf("query %s\n", *Q)
fmt.Printf(" %s\n", *A)
fmt.Printf("target %s\n", *T)
fmt.Println()
fmt.Printf("align-score : %d\n", result.Score)
fmt.Printf("align-region: q[%d, %d] vs t[%d, %d]\n",
result.QBegin, result.QEnd, result.TBegin, result.TEnd)
fmt.Printf("align-length: %d, matches: %d (%.2f%%), gaps: %d, gapRegions: %d\n",
result.AlignLen, result.Matches, float64(result.Matches)/float64(result.AlignLen)*100,
result.Gaps, result.GapRegions)
fmt.Println()
wfa.RecycleAlignmentText(Q, A, T) // !! important, recycle objects
}
wfa.RecycleAlignmentResult(result) // !! important, recycle objects
// ------------------[ clean ]------------------
wfa.RecycleAligner(algn) // !! important, recycle objects
CLI
A CLI (download binaries) is available to align two sequences from either positional arguments or an input file (format).
<details> <summary>Example</summary>Fast alignment.
$ wfa-go AGCTAGTGTCAATGGCTACTTTTCAGGTCCT AACTAAGTGTCGGTGGCTACTATATATCAGGTCCT
query AGCTA-GTGTCAATGGCTACT---TTTCAGGTCCT
| ||| ||||| |||||||| | |||||||||
target AACTAAGTGTCGGTGGCTACTATATATCAGGTCCT
cigar 1M1X3M1I5M2X8M3I1M1X9M
align-score : 36
match-region: q[1, 31]/31 vs t[1, 35]/35
align-length: 35, matches: 27 (77.14%), gaps: 4, gap regions: 2
From a input file, for benchmark.
$ wfa-go -i seqs.txt
query A-TTGGAAAATAGGATTGGGGTTTGTTTATATTTGGGTTGAGGGATGTCCCACCTTCGTCGTCCTTACGTTTCCGGAAGGGAGTGGTTAGCTCGAAGCCCA
|||||||||||||| ||||||||||||||||||||||||||||||||||||||| ||||||||||||||||||||||||||||||| ||||||||||||
target GATTGGAAAATAGGAT-GGGGTTTGTTTATATTTGGGTTGAGGGATGTCCCACCTT-GTCGTCCTTACGTTTCCGGAAGGGAGTGGTT-GCTCGAAGCCCA
cigar 1X1I14M1D39M1D31M1D12M
align-score : 36
match-region: q[2, 100]/100 vs t[3, 98]/98
align-length: 99, matches: 96 (96.97%), gaps: 3, gap regions: 3
</details>
<details>
<summary>Usage</summary>
WFA alignment in Golang
Author: Wei Shen <shenwei356@gmail.com>
Code: https://github.com/shenwei356/wfa
Version: v0.2.0
Input file format:
see https://github.com/smarco/WFA-paper?tab=readme-ov-file#41-introduction-to-benchmarking-wfa-simple-tests
Example:
>ATTGGAAAATAGGATTGGGGTTTGTTTATATTTGGGTTGAGGGATGTCCCACCTTCGTCGTCCTTACGTTTCCGGAAGGGAGTGGTTAGCTCGAAGCCCA
<GATTGGAAAATAGGATGGGGTTTGTTTATATTTGGGTTGAGGGATGTCCCACCTTGTCGTCCTT
