AnchorWave

Sensitive alignment of genomes with high sequence diversity, extensive structural polymorphism and whole-genome duplication variation

Generate Convert Improve

Install / Use

/learn @baoxingsong/AnchorWave

About this skill

Quality Score

0/100

README

AnchorWave · [![license][license-badge]][license]

Description

AnchorWave (Anchored Wavefront Alignment) identifies collinear regions via conserved anchors (full-length CDS and full-length exon have been implemented currently) and breaks collinear regions into shorter fragments, i.e., anchor and inter-anchor intervals. By performing sensitive sequence alignment for each shorter interval via a 2-piece affine gap cost strategy and merging them together, AnchorWave generates a whole-genome alignment for each collinear block. AnchorWave implements commands to guide collinear block identification with or without chromosomal rearrangements and provides options to use known polyploidy levels or whole-genome duplications to inform alignment.

Principle of the AnchorWave process

AnchorWave takes the reference genome sequence and gene annotation in GFF3 format as input and extracts reference full-length coding sequences (CDS) to use as anchors. Using a splice aware alignment program (minimap2 and GMAP have been tested) to lift over the start and end position of reference full-length CDS to the query genome (step 1). AnchorWave then identifies collinear anchors using one of three user-specified algorithm options (step 2) and uses the WFA and minimap2 algorithm to perform alignment for each anchor and inter anchor interval (step 4). Some anchor/inter-anchor regions cannot be aligned using our standard approach due to high memory and computational time costs. For these, AnchorWave either identifies novel anchors within long inter-anchor regions (step 3), or for those that cannot be split by novel anchors, aligns using the ksw_extd2 function implemented in minimap2 or a reimplemented sliding window approach (step 4). AnchorWave concatenates base pair sequence alignment for each anchor and inter-anchor region and outputs the alignment in MAF format (step 5).

Installation
Usage
Tips for following analysis
Guidelines
FAQ
Contact
Founding
Citation

Installation

Installation from source code

Dependencies

GNU GCC >=7.0
Cmake >= 3.0
minimap2 or GMAP
Operating System: Linux or MAC
Memory: > 20 Gb

If you would like to take the advantage of modern CPU to speed up please refer the document for advanced installation.
If you are working on a machine with ARM CPU, for example a MAC machine with M1/M2 CPU, please also refer the document for advanced installation.
If you are using old x86_64 CPUs without SSE4.1 but with SSE2, please also refer the document for advanced installation.

Compile

git clone https://github.com/baoxingsong/anchorwave.git
cd anchorwave
cmake ./
make

You will get an executable file named anchorwave . The code has been tested under Ubuntu 20.2 and CentOS 7 with intel/AMD CPU. It should work well on other REDHAT or Debian based Linux Distributions.

Installation using conda

conda install -c bioconda -c conda-forge anchorwave

Installation using Docker

Compile using your local docker with the Dockerfile in this package:
docker build -f docker/Dockerfile -t anchorwave ./
Test the installation:
docker run -it anchorwave anchorwave docker run -it anchorwave anchorwave gff2seq

Usage

In general, totally four commands are need to run through the whole pipeline.

extract CDS
align CDS to the reference genome
align CDS to the query genome
perform genome alignment

Note

AnchorWave use prior informations about whole genome duplication, chromosome rearrangement etc to guide the genome alignment, while AnchorWave could not figure out those evolution events automatically. Users need to know those informations before running AnchorWave and tune the parameters accordingly. Users might need to draw some plots to figure out if you would like to use genoAli or proali. If genoAli is proper, then need to think about if you would like to set IV. If proali is proper, then need to think about how to set the values of R, Q and maybe -e. Could refer guideline.pdf or #16 for how to do that.
To alignment highly diverse genomes, the command 4 might cost a couple of CPU days. If you have large memory available, this step could be paralyzed. Without heavily parameters turning, for highly diverse genomes, using a single thread, AnchorWave uses ~20Gb memory. Increasing a thread would cost an extra ~10Gb memory. If the two genomes have very similar sequences, the time and memory cost would be significantly less.

Options:

Program anchorwave
Usage: anchorwave <command> [options]
Commands:
    gff2seq     get the longest full-length CDS for each gene
    genoAli     whole chromosome global alignment and variant calling
    proali      genome alignment with relocation variation, chromosome fusion or whole genome duplication
    ali         perform global alignment for a pair of sequences using the 2-piece affine gap cost strategy

Lift over the reference full-length CDS start/stop coordinates to the query genome (command 1-3)

When extracting full-length CDS, if for a gene, there are multiple transcript isoforms, only the transcript with longest full-length CDS would be used.
The gff2seq output the concatenated full-length CDSs.
Options of the gff2seq function:

Usage: anchorwave gff2seq -i inputGffFile -r inputGenome -o outputSequences 
Options
 -h        produce help message
 -i FILE   reference genome annotation in GFF/GTF format
 -r FILE   reference genome sequence in fasta format
 -o FILE   output file of the longest CDS/exon for each gene
 -x        use exon records instead of CDS from the GFF file
 -m INT    minimum exon length to output (default: 20)

Example

Data

Arabidopsis thaliana Col-o reference genome and GFF3 annotation file from https://www.arabidopsis.org/
Arabidopsis thaliana Ler-0 accession assembly from http://www.pnas.org/content/113/28/E4052
We tested minimap2 and GMAP for this purpose, any other splice aware sequence alignment program should work, as long as it could generate alignment in SAM format

Using minimap2 for lift over

Since minimap2 could not deal with short CDS very well, and that causes error to lift over anchors to the query genome. To minimum this side effects, AnchorWave would ignore those short (-m parameter) CDS records.

anchorwave gff2seq -i TAIR10_GFF3_genes.gff -r tair10.fa -o cds.fa
minimap2 -x splice -t 10 -k 12 -a -p 0.4 -N 20 tair10.fa cds.fa > ref.sam
minimap2 -x splice -t 10 -k 12 -a -p 0.4 -N 20 ler.fa cds.fa > ler.sam

Alternatively using GMAP for lift over

anchorwave gff2seq -i TAIR10_GFF3_genes.gff -r tair10.fa -m 0 -o cds.fa

gmap_build --dir=./tair10 --genomedb=tair10 tair10.fa
gmap_build --dir=./ler --genomedb=ler ler.fa

gmap -t 10 -A -f samse -d tair10 -D tair10/ cds.fa > gmap_tair10.sam
gmap -t 10 -A -f samse -d ler -D ler/ cds.fa > gmap_ler.sam

Genome alignment without chromosomal rearrangement (an option of command 4)

This module perform base pair resolution sequence alignment for two genomes. A query chromosome sequence would be aligned against the reference chromosome with the same name.
The output would be an end-to-end sequence alignment for the whole chromosome in maf format.
A variant calling result in vcf format could be created which is derived from the end-to-end alignment.
Please make sure the chromosomes from reference genome and query genomes were named in the same way. Chromosomes with the same name would be aligned.

grep ">" ler.fa
grep ">" Col.fa

Anchors lift over using GMAP or minimap2

Anchors lift over using minimap2

anchorwave gff2seq -i TAIR10_GFF3_genes.gff -r tair10.fa -o cds.fa
minimap2 -x splice -t 10 -k 12 -a -p 0.4 -N 20 tair10.fa cds.fa > ref.sam
minimap2 -x splice -t 10 -k 12 -a -p 0.4 -N 20 ler.fa cds.fa > ler.sam

Anchors lift over using GMAP

anchorwave gff2seq -i TAIR10_GFF3_genes.gff -r tair10.fa -m 0 -o cds.fa
gmap_build --dir=./tair10 --genomedb=tair10 tair10.fa
gmap_build --dir=./ler --genomedb=ler ler.fa
gmap -t 10 -A -f samse -d tair10 -D tair10/ cds.fa > ref.sam
gmap -t 10 -A -f samse -d ler -D ler/ cds.fa > ler.sam

Per

Related Skills

node-connect

341.6k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

84.6k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

341.6k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

commit-push-pr

84.6k

Commit, push, and open a PR

baoxingsong

View profile

View on GitHub

GitHub Stars189

CategoryDevelopment

Updated28d ago

Forks19

baoxingsong/AnchorWave

Languages

C++

Security Score

100/100

Audited on Mar 2, 2026

No findings