Wfmash
base-accurate DNA sequence alignments using WFA and mashmap3
Install / Use
/learn @waveygang/WfmashREADME
wfmash
a pangenome-scale aligner
wfmash is an aligner for pangenomes that combines efficient homology mapping with base-level alignment. It uses MashMap 3.5 to find approximate mappings between sequences, then applies WFA (Wave Front Alignment) to obtain base-level alignments. MashMap 3.5 employs minmers, a generalization of minimizers that provides unbiased Jaccard similarity estimation for improved mapping accuracy.
wfmash is designed to make whole genome alignment easy. On a modest compute node, whole genome alignments of gigabase-scale genomes should take minutes to hours, depending on sequence divergence. It can handle high sequence divergence, with average nucleotide identity between input sequences as low as 70%. By default, wfmash automatically determines an appropriate identity threshold based on the ANI (Average Nucleotide Identity) distribution of your input sequences, using the median (50th percentile) for optimal balance between coverage and alignment quality.
wfmash is the key algorithm in pggb (the PanGenome Graph Builder), where it is applied to make an all-to-all alignment of input genomes that defines the base structure of the pangenome graph. It can scale to support the all-to-all alignment of hundreds of human genomes.
Algorithm Overview
wfmash performs alignment in several stages:
-
Mapping: Query sequences are broken into segments based on window size (default: 1kb) and mapped using MashMap with minmer sketches. Minmers are a generalization of minimizers that select multiple smallest k-mers per window, enabling unbiased Jaccard similarity estimation.
-
Chaining: Consecutive mappings separated by less than the chain gap (default: 2kb) are merged into longer approximate mappings.
-
Filtering: Various filters can be applied:
- L1 filtering requires a minimum number of sketch hits (default: 3)
- Plane-sweep filtering removes overlapping mappings
- Hypergeometric filtering assesses mapping significance
-
Scaffolding (optional): For large-scale alignments, scaffolding identifies syntenic regions:
- Chains are merged with larger gaps (default: 100kb) to form scaffolds
- Only chains with sufficient total length (default: 10kb) are considered
- Mappings are retained if they fall within a maximum distance (default: 100kb) from scaffold anchors
- This helps focus alignment on truly homologous regions while filtering out spurious matches
-
Alignment: Filtered mappings are aligned at base-level using WFA. Mappings are limited to 50kb by default because WFA's complexity is quadratic in the number of differences.
For approximate mapping only, use -m/--approx-mapping to skip the alignment stage, which allows working with much larger segment and mapping lengths.
Usage
wfmash [target.fa] [query.fa] {OPTIONS}
Basic Examples
Map query sequences against a reference:
wfmash reference.fa query.fa >aln.paf
All-vs-all alignment (map a set of sequences to themselves):
wfmash sequences.fa >aln.paf
Output only approximate mappings without base-level alignment:
wfmash -m reference.fa query.fa >mappings.paf
For PanSN-formatted all-vs-all mapping, exclude mappings within the same genome:
wfmash -Y '#' pangenome.fa >aln.paf
Parameter Groups
Minmer Sketching
-k[INT], --kmer-size=[INT]- k-mer size (default: 15)-s[INT], --sketch-size=[INT]- number of minmers per window (default: auto-calculated)-w[INT], --window-size=[INT]- window size for minmer selection (default: 1k)
Mapping Parameters
-m, --approx-mapping- output mappings only, no alignment-p[FLOAT|aniXX[+/-N]], --map-pct-id=[FLOAT|aniXX[+/-N]]- minimum identity percentage or ANI preset (default: ani50)- Fixed percentage:
-p 85sets 85% identity threshold - ANI presets:
-p ani25uses 25th percentile,-p ani50uses median (default) - Adjustments:
-p ani50-10uses median minus 10%,-p ani75+5uses 75th percentile plus 5%
- Fixed percentage:
-n[INT], --mappings=[INT]- number of mappings per segment (default: 1)-l[INT], --block-length=[INT]- minimum mapping block length (default: 0, no minimum)-c[INT], --chain-jump=[INT]- maximum gap to chain mappings (default: 2k)-P[INT], --max-length=[INT]- maximum mapping length for alignment (default: 50k)-N, --no-split- map each sequence as a single block
Filtering Options
-f, --no-filter- disable all filtering-M, --no-merge- keep fragment mappings separate-o, --one-to-one- report only best mapping per query/target pair-H[INT], --l1-hits=[INT]- minimum sketch hits for L1 filter (default: 3)-F[FLOAT], --filter-freq=[FLOAT]- filter high-frequency minimizers (default: 0.0002)--hg-filter=[n,Δ,conf]- hypergeometric filter parameters (default: 1.0,0.0,99.9)
Scaffolding Parameters (for synteny filtering)
-S[INT], --scaffold-mass=[INT]- minimum scaffold length (default: 10k)-D[INT], --scaffold-dist=[INT]- maximum distance from scaffold anchors (default: 100k)-j[INT], --scaffold-jump=[INT]- maximum gap for scaffold chaining (default: 100k)--scaffold-out=[FILE]- output scaffold chains to FILE--scaffold-overlap=[FLOAT]- overlap threshold for scaffold chain filtering (default: 0.5)
Selection Filters
-X, --self-maps- include self-mappings-Y[C], --group-prefix=[C]- exclude mappings within groups by prefix delimiter-L, --lower-triangular- only map seq_i to seq_j if i>j-T[pfx], --target-prefix=[pfx]- only map to targets with prefix-Q[pfxs], --query-prefix=[pfxs]- only map queries with prefix(es)
Alignment Parameters
-g[m,go1,ge1,go2,ge2], --wfa-params=[m,go1,ge1,go2,ge2]- WFA gap costs (default: 5,8,2,24,1)-E[INT], --target-padding=[INT]- bases to extend target region-U[INT], --query-padding=[INT]- bases to extend query region
Output Options
-a, --sam- output in SAM format (default: PAF)-d, --md-tag- include MD tag in output
System Parameters
-t[INT], --threads=[INT]- number of threads (default: 1)-I[FILE], --read-index=[FILE]- load pre-built index from FILE-W[FILE], --write-index=[FILE]- save index to FILE-b[SIZE], --batch=[SIZE]- target index batch size (default: 4G)
input indexing
wfmash requires a FASTA index (.fai) for its reference ("target"), and benefits if both reference and query are indexed.
We can build these indexes on BGZIP-indexed files, which we recommend due to their significantly smaller size.
To index your sequences, we suggest something like:
bgzip -@ 16 ref.fa
samtools faidx ref.fa.gz
Here, we apply bgzip (from htslib) to build a line-indexable gzip file, and then use samtools to generate the FASTA index, which is held in 2 files:
$ ls -l ref.fa.gz*
ref.fa.gz
ref.fa.gz.gzi
ref.fa.gz.fai
Advanced Examples
Mapping longer sequences without alignment
For long sequences where you only need approximate mappings:
wfmash -m -w 50k -P 500k reference.fa query.fa >mappings.paf
Standard alignment with default parameters
For typical whole-genome alignment (default: ani50, -S 10k):
wfmash reference.fa query.fa >aln.paf
Higher identity threshold
For very similar sequences only (e.g., 95% identity):
wfmash -p 95 reference.fa query.fa >aln.paf
Using ANI presets
Automatically determine identity threshold from data:
# Use median ANI for balanced sensitivity/specificity
wfmash -p ani50 reference.fa query.fa >aln.paf
# Use 75th percentile minus 5% for higher sensitivity
wfmash -p ani75-5 reference.fa query.fa >aln.paf
Multiple mappings per segment
To explore alternative alignments:
wfmash -n 3 reference.fa query.fa >aln.paf
Pangenome all-vs-all with scaffolding
For large-scale pangenome construction with synteny filtering:
wfmash -Y '#' -S 20k -j 200k --scaffold-out scaffolds.paf pangenome.fa >aln.paf
One-to-one mapping
To get only the best mapping between each query-target pair:
wfmash -o reference.fa query.fa >aln.paf
Scaffolding for Large-Scale Alignments
Scaffolding is a powerful feature for filtering alignments to focus on syntenic regions. It's particularly useful for:
- Whole-genome alignments
- Pangenome construction
- Reducing noise in highly repetitive sequences
The scaffolding algorithm:
- Merges chains with large gaps (up to
-j/--scaffold-jump, default 100kb) - Filters for chains with sufficient support (≥
-S/--scaffold-masssegments, default 5) - Keeps only mappings within
-D/--scaffold-dist(default 100kb) of scaffold anchors
This effectively identifies and preserves large-scale syntenic blocks while filtering out spurious matches.
Sequence Indexing
wfmash provides a progress log that estimates time to completion.
This depends on determining the total query sequence length.
To prevent lags when starting a mapping process, users should apply samtools index to index query and target FASTA sequences.
The .fai indexes are then used to quickly compute the sum of query lengths.
Installation
Static binaries
We provide static builds of wfmash releases targeted at the x86-64-v3 instruction set.
Bioconda
wfmash recipes for Bioconda are available at https://anaconda.org/bioconda/wfmash.
To install the latest version using Conda execute:
conda install -c biocon
