BITACORA

A comprehensive tool for the identification and annotation of gene families in genome assemblies

Genome annotation is a critical bottleneck in genomic research, especially for the comprehensive study of gene families in the genomes of non-model organisms. Despite the recent progress in automatic annotation, state-of-the-art tools used for this task often produce inaccurate annotations, such as fused, chimeric, partial or even completely absent gene models for many family copies, errors that require considerable extra efforts to be corrected. Here we present BITACORA, a bioinformatics tool that integrates popular sequence similarity-based search algorithms and Perl scripts to facilitate the curation of these inaccurate annotations and the identification of previously undetected gene family copies directly from genomic DNA sequences. The program creates general feature format (GFF) files, with both curated and newly identified gene models, and FASTA files with all predicted proteins. The output of BITACORA can be easily integrated in genomic annotation editors, greatly facilitating subsequent manual annotation and downstream analyses.

BITACORA documentation can be also be found in: http://www.ub.edu/softevol/bitacora/

Version history

BITACORA 1.4:

BITACORA can now handle the latest GeMoMa versions v1.8 and v1.9. Note that these newer versions have changed the heuristic for identifying multiple transcripts predictions, which improves the number of annotations, but I noticed that it also reports some very fragmented annotations that require additional filtering and validation. Therefore, consider using “-r T” and “-l” adjusted to the minimum expected length of the gene family if using the latest GeMoMa version 1.9.
Added new parameter "-z" that allows to retain all gene annotations without clustering identical copies. Current genome assemblies are mostly based in long-read sequencing which allows to assemble large genomic regions containing gene family tandem duplications, and finding highly identical genes in those regions is interesting to explore gene family evolution. Therefore, this option allows BITACORA to report all annotated genes, including identical copies. Note that it will also report partial genes. In addition, a new script is added in Tools “identify_similar_sequence_clusters.pl” that allows to identify highly identical clusters of genes.
Minor bug fixes and code optimization.

BITACORA 1.3:

BITACORA is able to use the latest GeMoMa version v1.7.1. However, v1.7.0 contains a bug that prevents our pipeline from working, please use v1.7.1 instead.
Added a new script to call BITACORA with command line options (run first '$ chmod +x runBITACORA_command_line.sh' to make it executable, and '$ ./runBITACORA_command_line.sh -h' to see the available options).

BITACORA 1.2.1:

Implementation of GeMoMa algorithm to reconstruct new gene models (set as default). The latest version tested in our pipeline is GeMoMa v1.6.4.
New parameter that allows retaining novel proteins based on HMMER or BLASTP positive hits.
Additional step to conduct a more strict filtering of the output annotations in order to obtain a confident estimation on the number of gene members for a specific gene family.

0. Contents

Installation
Prerequisites
Computational Requirements
Usage modes
- 4.1 Full mode
- 4.2 Protein mode
- 4.3 Genome mode
Parameters
Running BITACORA
Output
Example
Citation
Troubleshooting

1. Installation

BITACORA is distributed as a multiplatform shell script (runBITACORA.sh) that calls several other Perl scripts, which include all functions responsible of performing all pipeline tasks. Hence, it does not require any installation or compilation step.

To run the pipeline edit the master script runBITACORA.sh variables described in Prerequisites, Data, and Parameters.

2. Prerequisites

Perl: Perl is installed by default in most operating systems. See https://learn.perl.org/installing/ for installation instructions.
BLAST: Download blast executables from: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/
HMMER: The easiest way to install HMMER in your system is to type one of the following commands in your terminal:

  % brew install hmmer               # OS/X, HomeBrew
  % port install hmmer               # OS/X, MacPorts
  % apt install hmmer                # Linux (Ubuntu, Debian...)
  % dnf install hmmer                # Linux (Fedora)
  % yum install hmmer                # Linux (older Fedora)
  % conda install -c bioconda hmmer  # Anaconda

Or compile HMMER binaries from the source code: http://hmmer.org/

HMMER and BLAST binaries require to be added to the PATH environment variable. Specify the correct path to bin folders in the master script runBITACORA.sh, if necessary.

$ export PATH=$PATH:/path/to/blast/bin
$ export PATH=$PATH:/path/to/hmmer/bin

GeMoMa: By default, BITACORA reconstructs new gene models using the GeMoMa algorithm (Keilwagen et al., 2016; 2018). The GeMoMa jar file (i.e. GeMoMa-1.6.4.jar) must be specified in GEMOMAP variable in runBITACORA.sh. GeMoMa is implemented in Java using Jstacs and can be downloaded from: http://www.jstacs.de/index.php/GeMoMa. The latest version of GeMoMa tested in our pipeline is v1.9.

GEMOMAP=/path/to/GeMoMa.jar  (within runBITACORA.sh script)

3. Computational requirements

BITACORA have been tested in UNIX-based platforms (both in Mac OS and Linux operating systems). Multiple threading can be set in blast searches, which is the most time-consuming step, by editing the option THREADS in runBITACORA.sh.

For a typical good quality genome (~2Gb in size and ~10,000 scaffolds) and a standard modern PC (16Gb RAM), a full run of BITACORA is completed in less than 24h. This running time, however, will depend on the size of the gene family or the group of genes surveyed in a particular analysis. For gene families of 10 to 100 members, BITACORA spends from minutes to a couple of hours.

In case of larger or very fragmented genomes, BITACORA should be used in a computer cluster or workstation given the increase of RAM memory and time required.

4. Usage modes

4.1. Full mode

BITACORA has been initially designed to work with genome sequences and protein annotations (full mode). However, the pipeline can also be used either with only protein or only genomic sequences (protein and genome modes, respectively). These last modes are explained in next subsections.

Preparing the data: The input files (in plain text) required by BITACORA to run a full analysis are (update the complete path to these files in the master script runBITACORA.sh):

I. File with genomic sequences in FASTA format

II. File with structural annotations in GFF3 format

[NOTE: mRNA or transcript, and CDS are mandatory fields]

GFF3 example

lg1_ord1_scaf1770       AUGUSTUS        gene    13591   13902   0.57    +       .       ID=g1;
lg1_ord1_scaf1770       AUGUSTUS        mRNA    13591   13902   0.57    +       .       ID=g1.t1;Parent=g1;
lg1_ord1_scaf1770       AUGUSTUS        start_codon     13591   13593   .       +       0       Parent=g1.t1;
lg1_ord1_scaf1770       AUGUSTUS        CDS     13591   13902   0.57    +       0       ID=g1.t1.CDS1;Parent=g1.t1
lg1_ord1_scaf1770       AUGUSTUS        exon    13591   13902   .       +       .       ID=g1.t1.exon1;Parent=g1.t1;
lg1_ord1_scaf1770       AUGUSTUS        stop_codon      13900   13902   .       +       0       Parent=g1.t1;

BITACORA also accepts other GFF formats, such as Ensembl GFF3 or GTF.

[NOTE: GFF formatted files from NCBI can cause errors when processing the data, use the supplied script “reformat_ncbi_gff.pl” (located in the folder /Scripts/Tools) to make the file parsable by BITACORA]. See Troubleshooting in case of getting errors while parsing your GFF.

Ensembl FF3 example

AFFK01002511    EnsemblGenomes  gene    761     1018    .       -       .       ID=gene:SMAR013822;assembly_name=Smar1;biotype=protein_coding;logic_name=ensemblgenomes;version=1
AFFK01002511    EnsemblGenomes  transcript      761     1018    .       -       .       ID=transcript:SMAR013822-RA;Parent=gene:SMAR013822;assembly_name=Smar1;biotype=protein_coding;logic_name=ensemblgenomes;version=1
AFFK01002511    EnsemblGenomes  CDS     761     811     .       -       0       Parent=transcript:SMAR013822-RA;assembly_name=Smar1
AFFK01002511    EnsemblGenomes  exon    761     811     .       -       .       Parent=transcript:SMAR013822-RA;Name=SMAR013822-RA-E2;assembly_name=Smar1;constitutive=1;ensembl_end_phase=0;ensembl_phase=0;rank=2;version=1

III. Files with predicted proteins in FASTA format.

BITACORA requires identical IDs for proteins and their corresponding mRNAs or transcripts IDs in the GFF3.

[NOTE: we recommend using genes but not isoforms in BITACORA; We provide the following script Scripts/Tools/exclude_isoforms_fromfasta.pl to cluster isoforms, keeping the longest sequence as representative. Isoforms could also be removed or properly annotated after BITACORA analysis]

IV. Specific folder with files containing the query protein databases (YOURFPDB_db.fasta) and HMM profiles (YOURFPDB_db.hmm) in FASTA and hmm format, respectively, where the “YOURFPDB” label is your specific data file name. The addition of ”_db” to the database name with its proper extension, fasta or hmm, is mandatory.

BITACORA requires one protein database and profile per surveyed gene family (or gene group). See Example/DB files for an example of searching for two different gene families in BITACORA: OR, Odorant Receptors; and CD36-SNMP.

[NOTE: profiles covering only partially the proteins of interest are not recommended]

Notes on HMM profiles:

HMM profiles are found i

Bitacora

Install / Use

README