Bakta: rapid & standardized annotation of bacterial genomes, MAGs & plasmids

Bakta is a tool for the rapid & standardized annotation of bacterial genomes and plasmids from both isolates and MAGs. It provides dbxref-rich, sORF-including and taxon-independent annotations in machine-readable JSON & bioinformatics standard file formats for automated downstream analysis.

Description
Installation
Examples
Input & Output
Usage
Annotation Workflow
Database
Genome Submission
Protein bulk annotation
Genome plots
Auxiliary scripts
Web version
Citation
FAQ
Issues & Feature Requests

Description

Comprehensive & taxonomy-independent database Bakta provides a large and taxonomy-independent database using UniProt's entire UniRef protein sequence cluster universe. Thus, it achieves favourable annotations in terms of sensitivity and specificity along the broad continuum ranging from well-studied species to unknown genomes from MAGs.
Protein sequence identification Bakta exactly identifies known identical protein sequences (IPS) from RefSeq and UniProt allowing the fine-grained annotation of gene alleles (AMR) or closely related but distinct protein families. This is achieved via an alignment-free sequence identification (AFSI) approach using full-length MD5 protein sequence hash digests.
Fast This AFSI approach substantially accellerates the annotation process by avoiding computationally expensive homology searches for identified genes. Thus, Bakta can annotate a typical bacterial genome in 10 ±5 min on a laptop, plasmids in a couple of seconds/minutes.
Database cross-references Fostering the FAIR principles, Bakta exploits its AFSI approach to annotate CDS with database cross-references (dbxref) to RefSeq (WP_*), UniRef100 (UniRef100_*) and UniParc (UPI*). By doing so, IPS allow the surveillance of distinct gene alleles and streamlining comparative analysis as well as posterior (external) annotations of putative & hypothetical protein sequences which can be mapped back to existing CDS via these exact & stable identifiers (E. coli gene ymiA ...more). Currently, Bakta identifies ~350 mio, ~330 mio and ~290 mio distinct protein sequences from UniParc, UniRef100 and RefSeq, respectively. Hence, for certain genomes, up to 99 % of all CDS can be identified this way, skipping computationally expensive sequence alignments.
FAIR annotations To provide standardized annotations adhearing to FAIR principles, Bakta utilizes a versioned custom annotation database comprising UniProt's UniRef100 & UniRef90 protein clusters (FAIR -> DOI/DOI) enriched with dbxrefs (GO, COG, EC) and annotated by specialized niche databases. For each DB version we provide a comprehensive log file of all imported sequences and annotations.
Small proteins / short open reading frames Bakta detects and annotates small proteins/short open reading frames (sORF) which are not predicted by tools like Prodigal.
Expert annotation systems To provide high quality annotations for certain proteins of higher interest, e.g. AMR & VF genes, Bakta includes & merges different expert annotation systems. Currently, Bakta uses NCBI's AMRFinderPlus for AMR gene annotations as well as an generalized protein sequence expert system with distinct coverage, identity and priority values for each sequence, currenlty comprising the VFDB as well as NCBI's BlastRules.
Comprehensive workflow Bakta annotates ncRNA cis-regulatory regions, oriC/oriV/oriT and assembly gaps as well as standard feature types: tRNA, tmRNA, rRNA, ncRNA genes, CRISPR, CDS and pseudogenes.
GFF3 & INSDC conform annotations Bakta writes GFF3 and INSDC-compliant (Genbank & EMBL) annotation files ready for submission (checked via GenomeTools GFF3Validator, table2asn_GFF and ENA Webin-CLI for GFF3 and EMBL file formats, respectively for representative genomes of all ESKAPE species).
Bacteria & plasmids only Bakta was designed to annotate bacteria (isolates & MAGs) and plasmids, only. This decision by design has been made in order to tweak the annotation process regarding tools, preferences & databases and to streamline further development & maintenance of the software.
Reasoning By annotating bacterial genomes in a standardized, taxonomy-independent, high-throughput and local manner, Bakta aims at a well-balanced tradeoff between fully featured but computationally demanding pipelines like PGAP and rapid highly customizable offline tools like Prokka. Indeed, Bakta is heavily inspired by Prokka (kudos to Torsten Seemann) and many command line options are compatible for the sake of interoperability and user convenience. Hence, if Bakta does not fit your needs, please consider trying Prokka.

Installation

Bakta can be installed via BioConda, Docker, Singularity and Pip. However, we encourage to use Conda or Docker/Singularity to automatically install all required 3rd party dependencies.

In all cases a mandatory database must be downloaded.

BioConda

conda install -c conda-forge -c bioconda bakta

Podman (Docker)

We maintain a Docker image oschwengers/bakta providing an entrypoint, so that containers can be used like an executable:

podman pull oschwengers/bakta
podman run oschwengers/bakta --help

Installation instructions and get-started guides: Podman docs. For further convenience, we provide a shell script (bakta-podman.sh) handling Podman related parameters (volume mounting, user IDs, etc):

bakta-podman.sh --db <db-path> --output <output-path> <input>

For experienced users and full functionality (bakta_db & bakta_proteins), an image without entrypoint might be a better option. For these cases, please use one of the Biocontainer images:

export CONTAINER="quay.io/biocontainers/bakta:1.8.2--pyhdfd78af_0"
podman run -it --rm $CONTAINER bakta --help
podman run -it --rm $CONTAINER bakta_db --help

Pip

python3 -m pip install --user bakta

Bakta requires the following 3rd party software tools which must be installed and executable to use the full set of features:

tRNAscan-SE (2.0.12) https://doi.org/10.1101/614032 http://lowelab.ucsc.edu/tRNAscan-SE
Aragorn (1.2.41) http://dx.doi.org/10.1093/nar/gkh152 http://130.235.244.92/ARAGORN
INFERNAL (1.1.5) https://dx.doi.org/10.1093%2Fbioinformatics%2Fbtt509 http://eddylab.org/infernal
PILER-CR (1.06) https://doi.org/10.1186/1471-2105-8-18 http://www.drive5.com/pilercr
Pyrodigal (3.7.0) https://doi.org/10.21105/joss.04296 https://github.com/althonos/pyrodigal
PyHMMER (0.12.0) https://doi.org/10.21105/joss.04296 https://github.com/althonos/pyhmmer
Diamond (2.1.22) https://doi.org/10.1038/nmeth.3176 https://github.com/bbuchfink/diamond
Blast+ (2.17.0) https://www.ncbi.nlm.nih.gov/pubmed/2231712 https://blast.ncbi.nlm.nih.gov
AMRFinderPlus (4.2.7) https://github.com/ncbi/amr
pyCirclize (1.7.0) https://github.com/moshi4/pyCirclize

Database download

Bakta requires a mandatory database which is publicly hosted at Zenodo: We provide 2 types: full and light. To get best annotation results and to use all features, we recommend using the full (default). If you seek for maximum runtime performance or if download time/storage re

Bakta

Install / Use

README