BRAKER
BRAKER is a pipeline for fully automated prediction of protein coding gene structures with GeneMark-ES/ET/EP/ETP and AUGUSTUS in novel eukaryotic genomes
Install / Use
/learn @Gaius-Augustus/BRAKERREADME
BRAKER User Guide
News
Here is a recording of the first BGA23 workshop session on BRAKER. If learning by watching videos is easy for you, consider watching that: https://www.youtube.com/watch?v=UXTkJ4mUkyg
BRAKER3 is now in https://usegalaxy.eu/
Contacts for Repository
TSEBRA & BRAKER3 related:
- Lars Gabriel, University of Greifswald, Germany, lars.gabriel@uni-greifswald.de
BRAKER & AUGUSTUS related:
- Katharina J. Hoff, University of Greifswald, Germany, katharina.hoff@uni-greifswald.de, +49 3834 420 4624
GeneMark related:
-
Mark Borodovsky, Georgia Tech, U.S.A., borodovsky@gatech.edu
-
Tomas Bruna, Joint Genome Institute, U.S.A., bruna.tomas@gmail.com
-
Alexandre Lomsazde, Georgia Tech, U.S.A., alexandre.lomsadze@bme.gatech.edu
Core Authors of BRAKER
<b id="aff1">[a]</b> University of Greifswald, Institute for Mathematics and Computer Science, Walther-Rathenau-Str. 47, 17489 Greifswald, Germany
<b id="aff2">[b]</b> University of Greifswald, Center for Functional Genomics of Microbes, Felix-Hausdorff-Str. 8, 17489 Greifswald, Germany
<b id="aff3">[c]</b> Joint Georgia Tech and Emory University Wallace H Coulter Department of Biomedical Engineering, 30332 Atlanta, USA
<b id="aff4">[d]</b> School of Computational Science and Engineering, 30332 Atlanta, USA
<b id="aff5">[e]</b> Moscow Institute of Physics and Technology, Moscow Region 141701, Dolgoprudny, Russia
![braker2-team-2[fig10]](docs/figs/mario.png)
![braker2-team-1[fig11]](docs/figs/alex-katharina-tomas.png)
![braker2-team-3[fig12]](docs/figs/lars.jpg)
![braker2-team-4[fig13]](docs/figs/mark.png)
Figure 1: Current BRAKER authors, from left to right: Mario Stanke, Alexandre Lomsadze, Katharina J. Hoff, Tomas Bruna, Lars Gabriel, and Mark Borodovsky. We acknowledge that a larger community of scientists contributed to the BRAKER code (e.g. via pull requests).
Funding
The development of BRAKER1, BRAKER2, and BRAKER3 was supported by the National Institutes of Health (NIH) [GM128145 to M.B. and M.S.]. Development of BRAKER3 was partially funded by Project Data Competency granted to K.J.H. and M.S. by the government of Mecklenburg-Vorpommern, Germany.
Related Software
The Transcript Selector for BRAKER (TSEBRA) is available at https://github.com/Gaius-Augustus/TSEBRA .
GeneMark-ETP, one of the gene finders at the core of BRAKER, is available at https://github.com/gatech-genemark/GeneMark-ETP .
AUGUSTUS, the second gene finder at the core of BRAKER, is available at https://github.com/Gaius-Augustus/Augustus .
GALBA, a BRAKER pipeline spin-off for using Miniprot or GenomeThreader to generate training genes, is available at https://github.com/Gaius-Augustus/GALBA .
Contents
- Authors
- Funding
- What is BRAKER?
- Keys to successful gene prediction
- Overview of modes for running BRAKER
- Container
- Installation
- Running BRAKER
- Output of BRAKER
- Example data
- Starting BRAKER on the basis of previously existing BRAKER runs
- Bug reporting
- Citing BRAKER and software called by BRAKER
- License
What is BRAKER?
The rapidly growing number of sequenced genomes requires fully automated methods for accurate gene structure annotation. With this goal in mind, we have developed BRAKER1<sup name="a1">R1</sup><sup name="a0">R0</sup>, a combination of GeneMark-ET <sup name="a2">R2</sup> and AUGUSTUS <sup name="a3">R3, </sup><sup name="a4">R4</sup>, that uses genomic and RNA-Seq data to automatically generate full gene structure annotations in novel genome.
However, the quality of RNA-Seq data that is available for annotating a novel genome is variable, and in some cases, RNA-Seq data is not available, at all.
BRAKER2 is an extension of BRAKER1 which allows for fully automated training of the gene prediction tools GeneMark-ES/ET/EP/ETP <sup name="a14">R14, </sup><sup name="a15">R15, <sup name="a17">R17, </sup></sup><sup name="g1">F1</sup> and AUGUSTUS from RNA-Seq and/or protein homology information, and that integrates the extrinsic evidence from RNA-Seq and protein homology information into the prediction.
In contrast to other available methods that rely on protein homology information, BRAKER2 reaches high gene prediction accuracy even in the absence of the annotation of very closely related species and in the absence of RNA-Seq data.
BRAKER3 is the latest pipeline in the BRAKER suite. It enables the usage of RNA-seq and protein data in a fully automated pipeline to train and predict highly reliable genes with GeneMark-ETP and AUGUSTUS. The result of the pipeline is the combined gene set of both gene prediction tools, which only contains genes with very high support from extrinsic evidence.
In this user guide, we will refer to BRAKER1, BRAKER2, and BRAKER3 simply as BRAKER because they are executed by the same script (braker.pl).
Keys to successful gene prediction
-
Use a high quality genome assembly. If you have a huge number of very short scaffolds in your genome assembly, those short scaffolds will likely increase runtime dramatically but will not increase prediction accuracy.
-
Use simple scaffold names in the genome file (e.g.
>contig1will work better than>contig1my custom species namesome putative function /more/information/ and lots of special characters %&!*(){}). Make the scaffold names in all your fasta files simple before running any alignment program. -
In order to predict genes accurately in a novel genome, the genome should be masked for repeats. This will avoid the prediction of false positive gene structures in repetitive and low complexitiy regions. Repeat masking is also essential for mapping RNA-Seq data to a genome with some tools (other RNA-Seq mappers, such as HISAT2, ignore masking information). In case of GeneMark-ES/ET/EP/ETP and AUGUSTUS, softmasking (i.e. putting repeat regions into lower case letters and all other regions into upper case letters) leads to better results than hardmasking (i.e. replacing letters in repetitive regions by the letter
Nfor unknown nucleotide). -
Many genomes have gene structures that will be predicted accurately with standard parameters of GeneMark-ES/ET/EP/ETP and AUGUSTUS within BRAKER. However, some genomes have clade-specific features, i.e. special branch point model in fungi, or non-standard splice-site patterns. Please read the options section [options] in order to determine whether any of the custom options may improve gene prediction accuracy in the genome of your target species.
-
Always check gene prediction results before further usage! You can e.g. use a genome browser for visual inspection of gene models in context with extrinsic evidence data. BRAKER supports the generation of track data hubs for the UCSC Genome Browser with MakeHub for this purpose.
Overview of modes for running BRAKER
BRAKER mainly features semi-unsupervised, extrinsic evidence data (RNA-Seq and/or protein spliced alignment information) supported training of GeneMark-ES/ET/EP/ETP<sup name="g1">[F1]</sup> and subsequent training of AUGUSTUS with integration of extrinsic evidence in the final gene prediction step.
