TOGA

⚠️ TOGA1 is no longer maintained. We strongly encourage users to switch to TOGA2, which is faster, much more memory efficient, more accurate and offers numerous improvements over TOGA1.

TOGA is a new method that integrates gene annotation, inferring orthologs and classifying genes as intact or lost.

TOGA implements a novel machine learning based paradigm to infer orthologous genes between related species and to accurately distinguish orthologs from paralogs or processed pseudogenes.

This tutorial explains how to get started using TOGA. It shows how to install and execute TOGA, and how to handle possible issues that may occur.

For more details, please check out the TOGA wiki.

GitHub discussions section

Interested in contributing, have questions, or want to discuss the science behind TOGA? Head over to our Discussions section. It's a new, experimental space (authors did not have a chance to try this GitHub function yet) where we can talk about anything that doesn't quite fit into the Issues framework.

Changelog

For a detailed history of changes made to the TOGA project, please refer to the Changelog. This document provides version-specific updates, including new features, bug fixes, and other modifications.

Installation

TOGA is compatible with Linux and MacOS, including M1-based systems. It is recommended to use Python version 3.11.

It is highly recommended to have access to computational cluster, but for small or partial genomes with short genes a desktop PC will be enough.

TOGA requires Nextflow, which in turn requires java >=8. Check your version of java and install nextflow using one of the following commands:

curl -fsSL https://get.nextflow.io | bash
# OR
conda install -c bioconda nextflow

If you've downloaded nextflow using curl, move the nextflow executable to a directory accessible by your $PATH variable.

To get TOGA do the following:

# clone the repository
git clone https://github.com/hillerlab/TOGA.git
cd TOGA

Install necessary python packages using pip:

python3 -m pip install -r requirements.txt --user

Alternatively, if you use poetry, just do poetry install

Call configure.sh to:

train xgboost models
download CESAR2.0
compile C code

Run a test, it will take a couple of minutes: ./run_test.sh micro

If you see something like this at the very end, then TOGA is almost ready to go:

Orthology class sizes:
one2one: 3
Done! Estimated time: 0:01:02.800084
Program finished with exit code 0
JH567521 299723 336583 ENST00000618101.1169 879 + 299723 336583 0,0,200 7 28,923,130,173,200,179,248, 0,1256,6085,6677,19146,21311,36612,
JH567521 463144 506100 ENST00000262455.1169 711 - 463144 506100 0,200,255 8 102,103,142,112,117,58,116,185, 0,1982,30295,31351,36911,38566,41322,42771,
JH567521 395878 449234 ENST00000259400.1169 942 + 395878 449234 0,0,200 7 123,66,226,116,51,87,240, 0,11871,38544,45802,45994,52305,53116,
Success!

If you experience any problems installing TOGA, please visit the troubleshooting section.

Configuring TOGA for cluster

TOGA uses nextflow to run cluster-dependent steps. To run a pipeline on cluster nextflow requires a configuration file defining "executors" component. This repository contains configuration files for slurm cluster, please find them in the nextflow_config_files directory.

To create configuration files for non-slurm cluster do the following:

Find here what parameters are available for your cluster. Most likely, you can use slurm configuration files as a reference.
Create a separate directory for configuration files, or re-use nextflow_config_files dir.
Create "extract_chain_features_config.nf" file. This file contains configuration for chain features extraction step. These jobs are expected to be short and not memory consuming, so 1 hour of runtime limit and 10Gb of memory would be enough.
Create "call_cesar_config_template.nf" file. This configuration file is for CESAR jobs. These jobs usually take much longer that chain feature extraction, it's recommended to request 24 hours for them. You don't have to provide an exact amount of memory for these jobs, TOGA will compute this itself. Please write a placeholder instead, as follows: process.memory = "${_MEMORY_}G".

Final test

This repository also contains sample data to perform a wide-scale test. To do so, please download genome sequences for human (GRCh38/hg38) and mouse (GRCm38) in the 2bit format. You can download these 2bit files using the following links:

Human 2bit: wget https://hgdownload.cse.ucsc.edu/goldenpath/hg38/bigZips/hg38.2bit

Mouse 2bit: wget https://hgdownload.cse.ucsc.edu/goldenpath/mm10/bigZips/mm10.2bit

Then call the following:

./toga.py test_input/hg38.mm10.chr11.chain test_input/hg38.genCode27.chr11.bed ${path_to_human_2bit} ${path_to_mouse_2bit} --kt --pn test -i supply/hg38.wgEncodeGencodeCompV34.isoforms.txt --nc ${path_to_nextflow_config_dir} --cb 3,5 --cjn 500 --u12 supply/hg38.U12sites.tsv --ms

This will take about 20 minutes on 500 cores cluster.

Troubleshooting

Please see here.

Usage

This section explains TOGA usage, especially toga.py arguments and input files format.

Input files

TOGA is a reference-based genome annotation tool, which means that it needs the following data as input:

Gene annotation of the reference genome
Genome alignment between the reference and query genome(s)
Reference and query genome sequences

Gene annotation of reference genome

Bed-12 file

TOGA accepts a bed12-formatted file as a reference genome annotation. This file is mandatory for running TOGA.

Please find bed12 format specification under: https://genome.ucsc.edu/FAQ/FAQformat.html#format1

Example for human gene MAP1S which has two transcripts

user@user$ grep ENST00000544059 supply/hg38.wgEncodeGencodeCompV34.bed
chr19 17720155 17734490 ENST00000544059 0 + 17720390 17734428 0 7 275,102,83,141,2344,236,218, 0,780,3970,4893,5673,13037,14117,
user@user$ grep ENST00000324096 supply/hg38.wgEncodeGencodeCompV34.bed
chr19 17719479 17734513 ENST00000324096 0 + 17719502 17734428 0 7 141,102,83,141,2344,236,241, 0,1456,4646,5569,6349,13713,14793,

This repository contains examples of bed12 file for human and mouse:

Human genome annotation: supply/hg38.wgEncodeGencodeCompV34.bed
Mouse genome annotation: supply/mm10.wgEncodeGencodeCompVM25.bed

Some advice about your reference annotation:

Please make sure that the length of the CDS of your annotations is divisible by 3. TOGA will skip transcripts that do not satisfy this criteria.
This is highly recommended that CDS of your transcripts start with ATG and end with a canonical stop codon.
Your transcripts are coding, meaning that thickStart and thickEnd are not equal. TOGA would skip non-coding transcripts.
Avoid any pseudogenes in the reference annotations.
Also, try to avoid merged and incomplete transcripts.
Make sure that transcript identifiers are unique, e.g. avoid cases where two or more transcripts have the same identifier.

Optional but highly recommended: Isoform data

One gene can have multiple isoforms. TOGA can handle more than one isoform per gene, meaning it is not necessary to reduce the transcript data to the isoform with the longest CDS. Isoform data is optional, but if available increases annotation completeness and gene loss determination accuracy. If you do not provide isoforms data, TOGA will treat each transcript in the bed12 file as a separate gene.

Isoforms can be provided to TOGA in a single two-column tab-separated file in the following format: GeneIdentifier {tab} TranscriptIdentifier The first line can be a header.

Example: The human gene MAP1S (ENSG00000130479) has 2 isoforms: ENST00000544059 and ENST00000324096.

user@user$ grep ENSG00000130479 supply/hg38.wgEncodeGencodeCompV34.isoforms.txt
ENSG00000130479 ENST00000544059
ENSG00000130479 ENST00000324096

Importantly, all transcripts listed in the bed12 file have to occur in this isoform file, otherwise TOGA throw an error. Examples for human and mouse Gencode annotations:

For human: hg38.wgEncodeGencodeCompV34.isoforms.txt
For mouse: mm10.wgEncodeGencodeCompVM25.isoforms.txt

The simplest way to obtain isoforms file is:

Visit https://www.ensembl.org/biomart/martview
Choose Ensembl Genes N dataset and then [species of interest] genes
Go to Filters tab, select "gene type" - protein coding
Go to Attributes tab, select:
- Gene stable ID
- Transcript stable ID
- Uncheck all other marks!
Download the results as a tsv file

U12 introns data

You also can provide data of U12 exons in the reference genome, it would facilitate gene loss detection process. However, this is not mandatory.

There are exampl

TOGA

Install / Use

README

TOGA

⚠️ TOGA1 is no longer maintained. We strongly encourage users to switch to TOGA2, which is faster, much more memory efficient, more accurate and offers numerous improvements over TOGA1.

GitHub discussions section

Changelog

Installation

Configuring TOGA for cluster

Final test

Troubleshooting

Usage

Input files

Gene annotation of reference genome

Bed-12 file

Optional but highly recommended: Isoform data

U12 introns data