toxcodan_logo

ToxCodAn

ToxCodAn is a computational tool designed to detect and annotate toxin genes in transcriptome assembly.

The guide for venom gland transcriptomics is available here

Getting Started

Installation

Download the master folder and follow the steps below:

unzip ToxCodAn-master.zip
export PATH=$PATH:path/to/ToxCodAn-master/bin/

OR git clone the ToxCodAn respository and add the bin folder into your PATH:

git clone https://github.com/pedronachtigall/ToxCodAn.git
export PATH=$PATH:path/to/ToxCodAn/bin/

Requirements

Python3 and Biopython
- apt-get install python3-biopython
Perl, Bioperl and MCE (libmce-perl)
- apt-get install bioperl libmce-perl
CodAn
NCBI-BLAST (v2.9.0 or above)
SignalP-4.1
HMMER (used in NonToxin annotation step - which is optional)
DIAMOND (v2.0.6 or higher) - Optional tool (to increase speed in NonToxin annotation step)

Ensure that all requirements are working properly.

:warning: If the user wants to install ToxCodAn and all dependencies using Conda environment, follow the steps below:

Create the environment:
- conda create -n toxcodan_env -c bioconda python=3.6 biopython=1.69 codan blast hmmer
Git clone the ToxCodAn repository and add to your PATH:
- git clone https://github.com/pedronachtigall/ToxCodAn.git
- export PATH=$PATH:path/to/ToxCodAn/bin/
Download the SignalP-4.1, decompress and add it to your PATH:
- tar -xzf signalp-4.1g.Linux.tar.gz
- export PATH=$PATH:path/to/signalp-4.1/
- Change the line number 13 of "signalp" (path/to/signalp-4.1/signalp) to:
  - $ENV{SIGNALP} = 'path/to/signalp-4.1/';
It may be needed to apply "execution permission" to all bin executables in "CodAn/bin" and "ToxCodAn/bin/":
- chmod 777 path/to/ToxCodAn/bin/*
Then, run ToxCodAn as described in the "Usage" section.
To activate the environment to run ToxCodAn just use the command: conda activate toxcodan_env
To deactivate the environment just use the command: conda deactivate
:warning:Tip:warning: Ensure that you have added all conda channels properly:
- conda config --add channels defaults && conda config --add channels bioconda && conda config --add channels conda-forge

Models

The model folder contains specific gHMM models and the toxinDB used in the ToxCodAn pipeline.

Download the models.zip file, uncompress (unzip models.zip) and specify it to the -m option of ToxCodAn command line (-m path/to/models/).

Usage

Usage: toxcodan.py [options]

Options:
  -h, --help            show this help message and exit
  -s string, --sample=string
                        Optional - sample ID to be used in the output files
                        [default=toxcodan]
  -t fasta, --transcripts=fasta
                        Mandatory - transcripts in FASTA format,
                        /path/to/transcripts.fasta
  -o folder, --output=folder
                        Optional - output folder, /path/to/output_folder; if
                        not defined, the output folder will be set in the
                        current directory [ToxCodAn_output]
  -m path, --model=path
                        Mandatory - path to model folder, /path/to/models
  -p boolean value, --signalp=boolean value
                        Optional - turn on/off the signalP filtering step, use
                        True to turn on or False to turn off [default=True]
  -P boolean value, --partial=boolean value
                        Optional - turn on/off the partial filtering step, use
                        True to turn on or False to turn off [default=False]
  -n path, --nontoxinannotation=path
                        Optional - path to folder containing the protein DB
                        and CodAn model to be used in the NonToxin Annotation
                        pipeline [default=None]
  -c int, --cpu=int     Optional - number of threads to be used in each step
                        [default=1]
  -f int, --covprefilter=int
                        Optional - threshold value used as the minimum
                        coverage in the pre-filter step [default=90]
  -F int, --covtoxinfilter=int
                        Optional - threshold value used as the minimum
                        coverage in the toxin filter step [default=80]

Basic usage:

toxcodan.py -t transcripts.fa -m path/to/models

Check our tutorial to learn how to use ToxCodAn.

Inputs

ToxCodAn has the following inputs as mandatory:

Transcripts in fasta format through the -t option.
The uncompressed models folder through the -m option

Outputs

ToxCodAn outputs the following files:

SampleID_Toxins_cds.fasta
SampleID_Toxins_pep.fasta
SampleID_Toxins_annotation.gtf
SampleID_Toxins_contigs.fasta
SampleID_PutativeToxins_cds.fasta
SampleID_PutativeToxins_contigs.fasta
SampleID_NonToxins_contigs.fasta

SampleID_Toxins_cds_SPfiltered.fasta (optional step)
SampleID_Toxins_pep_SPfiltered.fasta (optional step)
SampleID_Toxins_contigs_SPfiltered.fasta (optional step)
SampleID_Toxins_cds_SPfiltered_RedundancyFiltered.fasta (optional step)

signalp_annotation.gff (optional step)
RemoveRedundancy.log

Description of the output files:

cds -> coding sequence of the predicted toxins
pep -> protein sequence of the predicted toxins
contigs -> whole contigs containing the predicted CDSs
Toxins -> sequences with very high probability of being toxins
PutativeToxins -> sequences with medium/high probability of being toxins
NonToxins -> sequences with very low probability of being toxins
RedundancyFiltered -> CDSs with 100% identity filtered
SPfiltered -> signalP filtered sequences (optional step)

Annotation of Non Toxin transcripts

The user can take advantages of a simple script designed to annotate Non Toxin transcripts named NonToxinAnnotation.py. Follow the steps below:

First, perform the CDS prediction with the "VERT_full" model using CodAn (reference Nachtigall et al. (2020))
- codan.py -t path/to/NonToxins_contigs.fasta -m path/to/VERT_full/ -o path/to/output/NonToxins_codan/ -c N
- We have a copy of the "VERT_full" in the "non_toxin_models" folder: cd path/to/non_toxin_models/ and gzip -d VERT_full
Then, use the NonToxinAnnotation.py on the predicted CDSs.
This script performs blast search (mandatory) and hmm search using BUSCO and Pfam models (optional).
The use of a protein DB pre-compiled or designed with makeblastdb can be set with the -d option.
- The user can use a DB such as Swissprot and/or the designed protein DB available at the "non_toxin_models" folder (just uncompress the DB tar xjf pepDB.tar.bz2).
- The user can set one or more DBs by using a comma "," among DBs, which can be any number (from 1 to N).
Optionally, the user can set any of the BUSCO models to perform hmm search by using the option -b.
Optionally, the user can set the Pfam models to perform hmm search by using the option -p. (link for download the pfam.hmm: ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.gz and decompress the model with gunzip Pfam-A.hmm.gz)
- Please, notice that you may need to design auxfiles for the Pfam models before first use: hmmpress Pfam-A.hmm
This script takes advantage of MultiThreading by using the option -c.
Usage: NonToxinAnnotation.py -t path/to/output/NonToxins_codan/ORF_sequences.fasta -d path/to/db1,...,path/to/dbN -b path/to/busco/odb -p path/to/pfam.hmm -c N

:warning: [Attention 1] If the user wants to speed up the process and use DIAMOND tool in the NonToxin annotation, just follow the steps below:

The diamond tool can be installed through the command: conda install -c bioconda diamond.
Design the diamond DB by using a set of protein sequences: diamond makedb --in proteins.fasta -d diamondDB.
Then, use the NonToxinAnnotation.py on the predicted CDSs by setting the option -s diamond and the diamond DB in the -b option.
- NonToxinAnnotation.py -s diamond -t path/to/output/NonToxins_codan/ORF_sequences.fasta -d path/to/diamondDB -c N.
- Keep the -b path/to/busco/odb -p path/to/pfam.hmm options to perform the hmm search using BUSCO and Pfam models as described above.

:warning: [Attention 2] Alternatively, if the user wants to directly perform the NonToxins annota

ToxCodAn

Install / Use

README

ToxCodAn

Getting Started

Installation

Requirements

Models

Usage

Inputs

Outputs

Annotation of Non Toxin transcripts