SkillAgentSearch skills...

FASTGA

Pairwise whole genome aligner

Install / Use

/learn @thegenemyers/FASTGA
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

FastGA: A Fast Genome Aligner

<font size ="4">Authors: Gene Myers & Chenxi Zhou<br> First: May 10, 2023<br> Last: December 30, 2025<br>

  • FastGA Compare two genomes or a genome against itself and output a .1aln, .paf, or .psl file of all alignments found.

  • Sub-Process Routines

    • FAtoGDB: Convert a FASTA or ONEcode sequence file into a genome database (GDB)
    • GIXmake: Build a genome index (GIX) for a given GDB
    • ALNtoPAF: Stream PAF formatted alignments for a given .1aln file
    • ALNtoPSL: Stream PSL formatted alignments for a given .1aln file
  • Viewing Utilities

    • GDBshow: Display select contigs or substrings thereof from a GDB
    • GDBstat: Display various statistics and histograms of the scaffolds & contigs in a GDB
    • ANOshow: Display annotation intervals of select contigs or subranges thereof from an ANO file
    • ANOstat: Display various statistics and histograms of about the intervals in an ANO file
    • GIXshow: Display range of a GIX
    • ALNshow: Display selected alignments in a .1aln file in a variety of forms
    • ALNplot: Display alignments in a .1aln or .paf file in a static collinear plot
  • Additional Utilities

    • GDBtoFA: Converts a GDB back to the FASTA or ONEcode sequence file it was derived from
    • BEDtoANO: Convert a BED formatted file to a .1ano-file
    • ANOtoBED: Convert an ANO file to a BED file
    • PAFtoALN: Convert a PAF formatted file with X-CIGAR strings to a .1aln file
    • PAFtoPSL: Convert a PAF formatted file with X-CIGAR strings to a .psl file
    • GIXrm: Remove GDBs and GIXs including their hidden parts
    • GIXcp: Copy GDBs and GIXs including their hidden parts as an ensemble
    • GIXmv: Move GDBs and GIXs including their hidden parts as an ensemble
    • ALNchain: Alignment filtering by construction of local chains
    • ALNreset: Reset a .1aln file's internal references to the GDB(s) it was computed from
  • C-Library for Accessing .1aln Files

Version 1.5 (December 30, 2025) ONEcode ANO Files

The addition of soft masking in V1.3 has lead to the development of a ONEcode version of a BED file, called an ANO- or .1ano-file, that records a collection of (oriented) intervals on a genome, along with a possible label and/or score for each annotation. There are new routines ANOshow, ANOstat, ANOtoBED, and BEDtoANO for showing the contents of a .1ano file, giving summary statistics about an .1ano file, and converting between BED files and .1ano files, respectively.

This change augments the interface to GIXmake, FastGA, GDBtoFA, and GDBshow, and has changed the operation of FAtoGDB. Previously, if FAtoGDB detected an "implicit" mask in the source FASTA file indicated by masked regions being lower case, and unmasked regions upper case, then this mask was recorded in the GDB. Now it is recorded in a separate .1ano file that has the same location and root name as the target GDB. Further, GIXmake has been upgraded so that you can specify multiple .1ano files on the command line, and the GIX will be masked with the union of these. Note carefully that the only way to change the mask encoded in a GIX is to rebuild the GIX. By default FastGA now does not use the soft mask in the GIX's of the genomes, but the -M flag instructs it to so so. You can also now follow each FastGA genome argument with a list of masks to apply (syntactically a primary argument beginning with #) in which case the GIX's will be rebuilt with the specified mask(s) and FastGA will soft mask accordingly. Lastly, GDBtoFA and GDBshow now take an optional #-sign mask argument that if present masks the result accordingly.

Version 1.4 (November 1, 2025) ONEaln C-Library

A C-library of routines designed to make it easy to read and access the contents of a .1aln file has been added. The interface is described here. This library of routines is in ONEaln.c with the interface declared in ONEaln.h. Caution: several of the modules used by FastGA must also be compiled in, namely, GDB.[ch], ONElib.[ch], alncode.[ch], align.[ch], and gene_core.[ch]. See the make command for ONEalnTEST in the Makefile.

Version 1.3 (July 23, 2025) Soft Masking and Log Files

Soft masking is now supported and taken advantage of by the FastGA suite. Soft masking is assumed to be specified in the input FastA files by denoting masked sequence in lower-case and unmasked sequence in upper-case. Such masks are recorded in our GDB's and in a suitable form within our GIX indices. The later required a slight modification to the GIX data structure. Old GIX's are still recognized and supported, but if you want masking you must rebuild any GDB's and GIX's that were produced previously.

Additionally,

  • GIXmake has been substantially improved to use much less memory and make better use of threads.

  • The IO performance of ALNtoPAF and ALNtoPSL has also been substantially improved.

  • A -L log file option has been added to support HPC cluster usage.

We are seeking similar improvements in FastGA proper, better handling of satellitic repeats, and higher sensitivity for distant genomes without resorting to using LastZ as a subroutine.

Overview

FastGA searches for all local DNA alignments between two high quality genomes. The core assumption is that the genomes are nearly complete involving at most several thousand contigs with a sequence quality of Q40 or better. Based on a novel adaptive seed finding algorithm and the first wave-based local aligner developed for daligner (2012), the tool can for example compare two 2Gbp bat genomes finding almost all regions over 100bp that are 70% or more similar in about 5.0 minutes wall clock time on my MacPro with 8 cores (about 28 CPU minutes). Moreover, it uses a trace point concept to record all the found alignments in a compressed and indexable ONEcode file in a very space-efficient manner, e.g. just 44.5MB for over 635,000 local alignments in our running example. These trace point encodings of the alignments can then be swiftly translated into .psl or .paf format on demand with programs provided here.

Using FastGA can be as simple as calling it with two FASTA files containing genome assemblies where each entry is a scaffold with runs of N's separating and potentially giving the estimated distance between the contigs thereof. By default a PAF file encoding all the local alignments found between the two genomes is streamed to the standard output. In the subdirectory EXAMPLE you will find a pair of sample input files, an output file, and a text file, sample_session capturing a session that serves to illustrate the use of FastGA. Try it for yourself.

Under the surface, a number of intermediate steps take place. First, the FASTA files are converted to genome databases with extension .1gdb that are a ONEcode binary file and associated hidden file containing the ASCII DNA sequences in 2-bit compressed form. This allows FastGA to randomly access contigs and do so with four times less IO and no text parsing. Second, a genome index with extension .gidx is then built for each genome that is basically a truncated suffix array. One of the things that makes FastGA fast is that it compares these two indices against each other directly rather than looking up sequences of one genome in the index of the other. Third, FastGA records all the alignments it finds in a ONEcode binary file we refer to here as a ALN-formated file with extension .1aln that uses a very space efficient trace point encoding of each alignment. Finally in linear time, this trace point representation is converted into the desired PAF output. Note carefully, that one has the option to keep the results in the very disk efficient ALN format, and then convert it to any of PAF, PSL, or other desired alignment format on demand. The diagram immediately below summarizes and details the data flow just described.

Fig. 1

While the entire set of blue shadowed processes can be fired off by simply calling FastGA, we provide routines to perform each step under direct control (labeled in blue along dataflow arrows). In addition we provide utilities labeled in brown that allow one to examine the intermediate GDB, GIX, and ALN files. An invocation of FastGA with the -k option or direct application of the sub-process routines, create persistent GDB and GIX entities that can be reused saving time if a given genome is to be compared repeatedly. The GDB and GIX items are actually an ensemble, consisting of a proxy file and a number of hidden files. So we provide the utilities GIXmv, GIXcp, and GIXrm to manipulate these as an ensemble. Finally, we provide the utility GDBtoFA that inverts the process of converting a FASTA file into a GDB, providing the option of removing all your fasta files, compressed or not, for the space efficient GDB representation.

FastGA features the use of the ONEcode data encoding framework with both its' GDB and ALN files that encode all the alignments found. As such FastGA also supports as input ONEcode sequence files that encode a genome, in addition to the usual Fasta format. So both FAtoGDB and GDBtoFA (despite their names) also recognize and support ONEcode SEQ files as well as FASTA.

There are three convention

Related Skills

View on GitHub
GitHub Stars233
CategoryDevelopment
Updated13d ago
Forks16

Languages

C

Security Score

80/100

Audited on Mar 20, 2026

No findings