GALBA User Guide

Docker Pulls GitHub last commit (branch) GitHub issues

Contact for Github Repository of GALBA at https://github.com/Gaius-Augustus/GALBA:

Katharina J. Hoff, University of Greifswald, Germany, katharina.hoff@uni-greifswald.de, +49 3834 420 4624, @katharinahoff.bsky.social, @KatharinaHoff@fosstodon.org

Authors of GALBA

Tomas Brunae, Heng Lic, d , Joseph Guhlinf, Daniel Honselg, Steffen Herboldh, Natalia Nenashevaa, b, Matthis Ebela, b, Lars Gabriela, b, Mario Stankea, b, and Katharina J. Hoffa, b

[a] University of Greifswald, Institute for Mathematics and Computer Science, Walther-Rathenau-Str. 47, 17489 Greifswald, Germany

[b] University of Greifswald, Center for Functional Genomics of Microbes, Felix-Hausdorff-Str. 8, 17489 Greifswald, Germany

[c] Dana-Farber Cancer Institute, 450 Brookline Ave, Boston, MA 02215, USA

[d] Harvard Medical School, 10 Shattuck St, Boston, MA 02215, USA

[e] Joint Genome Institute, Lawrence Berkeley National Laboratory, USA

[f] Genomics Aotearoa and Laboratory for Evolution and Development, Department of Biochemistry, University of Otago, Dunedin, 9016, New Zealand

[g] Institute of Computer Science, University of Göttingen, 37077, Göttingen, Germany

[h] Faculty for Computer Science and Mathematics, University of Passau, 94032, Passau, Germany

Acknowledgements

GALBA code was derived from the BRAKER code, where a similar pipeline for using GenomeThreader with BRAKER was once published in R9. We hereby acknowledge the contributions of all BRAKER authors (in particular Simone Lange, Anica Hoppe, Alexandre Lomsadze, and Mark Borodovsky) to the code that GALBA was derived from, and we are grateful for funding for BRAKER development by the National Institutes of Health (NIH) grant GM128145, which indirectly also supported development of GALBA.

Ethan Tolman and Paul Frandsen first mentioned the idea of the DIAMOND filter for GALBA gene sets in their preprint doi: https://doi.org/10.1101/2023.12.11.569651 .

Related Software

GALBA code was derived from BRAKER, a fully automated pipeline for predicting genes in the genomes of novel species with RNA-Seq data and a large-scale database of protein sequences (that must not necessarily be closely related to the target species) with GeneMark-ES/ET/EP/ETP and AUGUSTUS. BRAKER is available at https://github.com/Gaius-Augustus/BRAKER
TSEBRA can be used to combine GALBA gene sets with BRAKER gene sets. TSEBRA is available at https://github.com/Gaius-Augustus/TSEBRA .

Authors
Acknowledgements
What is GALBA?
Keys to successful gene prediction
Singularity Image
Installation
- Supported software versions
- GALBA
Running GALBA
- GALBA pipeline modes
- Description of selected GALBA command line options
Output of GALBA
Example data
Accuracy
Bug reporting
- Reporting bugs on github
- Common problems
Citing GALBA and software called by GALBA
License

What is GALBA?

The rapidly growing number of sequenced genomes requires fully automated methods for accurate gene structure annotation. Here, we provide a fully automated gene pipeline that trains AUGUSTUSR3, R4 for a novel species and subsequently predicts genes with AUGUSTUS in the genome of that species. GALBA uses the protein sequences of several (few) or one closely related species to generate a training gene set for AUGUSTUS with either miniprotR1, or GenomeThreaderR2. After training, GALBA uses the evidence from protein to genome alignment during gene prediction.

:warning: Please note that the popular BRAKERR5, R6 pipeline will very likely produce more accurate results than GALBA in small and medium size genomes (such als C. elegans, A. thaliana, D. melanogaster, ...). Instead of using protein sequences of only one closely related species, BRAKER is capable of using proteins from a large sequence database where the species in the database must not necessarily be closely related to the target species. BRAKER can also incorporate RNA-Seq data. In contrast to GALBA, BRAKER achieves high gene prediction accuracy even in the absence of the annotation of very closely related species (and in the absence of RNA-Seq data). However, GALBA has a clear advantage in large genomes (e.g. Mus musculus, Gallus gallus, ...) if you use input proteins from a close relative. Before deciding to use GALBA, please read the Accuracy section.

If you are not sure which pipeline to use: GALBA or BRAKER? The answer is: if you have no RNA-Seq data and the genome is large, use GALBA! Otherwise use BRAKER, first.

GALBA is named after Servius Sulpicius Galba, who ruled the Roman Empire only for a short time, before he was murdered. The name seems appropriate, because both BRAKER2 and also the soon published BRAKER3 achieve in some cases higher accuracy than GALBA ever will, and AI is on the rise.

Keys to successful gene prediction

Use a high quality genome assembly. If you have a huge number of very short scaffolds in your genome assembly, those short scaffolds will likely increase runtime dramatically but will not increase prediction accuracy.
Use simple scaffold names in the genome file (e.g. >contig1 will work better than >contig1my custom species namesome putative function /more/information/ and lots of special characters %&!*(){}). Make the scaffold names in all your fasta files simple before running any alignment program.
In order to predict genes accurately in a novel genome, the genome should be masked for repeats. This will avoid the prediction of false positive gene structures in repetitive and low complexitiy regions. In the case of AUGUSTUS, softmasking (i.e., putting repeat regions into lower case letters and all other regions into upper case letters) leads to better results than hardmasking (i.e., replacing letters in repetitive regions by the letter N for unknown nucleotide). GALBA always treats genomes as softmasked for repeats!
Always check gene prediction results before further usage! You can, e.g. use a genome browser for visual inspection of gene models in context with extrinsic evidence data. GALBA supports the generation of track data hubs for the UCSC Genome Browser with MakeHub for this purpose.

Overview running GALBA

GALBA mainly features semi-unsupervised, protein sequence evidence data supported training of AUGUSTUS with integration of extrinsic evidence in the final gene prediction step. GALBA can be used either with Miniprot or GenomeThreader as a protein spliced aligner. Miniprot is our preferred aligner because it continues to undergo development, we have put a lot of work into improving the integration of miniprot evidence (e.g. miniprothint), and is faster than GenomeThreader. We highly recommend to use Miniprot with GALBA. GenomeThreader is only included in GALBA for internal benchmarking purposes. We stopped testing GenomeThreader functionality a while ago, we do not include GenomeThreader in our containers. The GALBA pipeline with miniprot looks works like this (Figure 1 from the GALBA publication at https://link.springer.com/article/10.1186/s12859-023-05449-z):

galba-miniprot[fig1]

Figure a: training AUGUSTUS on the basis of spliced alignment information from proteins of a closely related species against the target genom

GALBA

Install / Use

README