ToxCodAn
Toxin genes annotation in venom gland transcriptome assembly
Install / Use
/learn @pedronachtigall/ToxCodAnREADME

ToxCodAn
<!---[](https://github.com/pedronachtigall/ToxCodAn/releases/latest) --> <!---[](https://doi.org/10.5281/zenodo.3403273) -->ToxCodAn is a computational tool designed to detect and annotate toxin genes in transcriptome assembly.
The guide for venom gland transcriptomics is available here
Getting Started
Installation
Download the master folder and follow the steps below:
unzip ToxCodAn-master.zip
export PATH=$PATH:path/to/ToxCodAn-master/bin/
OR git clone the ToxCodAn respository and add the bin folder into your PATH:
git clone https://github.com/pedronachtigall/ToxCodAn.git
export PATH=$PATH:path/to/ToxCodAn/bin/
Requirements
- Python3 and Biopython
apt-get install python3-biopython
- Perl, Bioperl and MCE (libmce-perl)
apt-get install bioperl libmce-perl
- CodAn
- NCBI-BLAST (v2.9.0 or above)
- SignalP-4.1
- HMMER (used in NonToxin annotation step - which is optional)
- DIAMOND (v2.0.6 or higher) - Optional tool (to increase speed in NonToxin annotation step)
Ensure that all requirements are working properly.
:warning: If the user wants to install ToxCodAn and all dependencies using Conda environment, follow the steps below:
-
Create the environment:
conda create -n toxcodan_env -c bioconda python=3.6 biopython=1.69 codan blast hmmer
-
Git clone the ToxCodAn repository and add to your PATH:
git clone https://github.com/pedronachtigall/ToxCodAn.gitexport PATH=$PATH:path/to/ToxCodAn/bin/
-
Download the SignalP-4.1, decompress and add it to your PATH:
tar -xzf signalp-4.1g.Linux.tar.gzexport PATH=$PATH:path/to/signalp-4.1/- Change the line number 13 of "signalp" (path/to/signalp-4.1/signalp) to:
$ENV{SIGNALP} = 'path/to/signalp-4.1/';
-
It may be needed to apply "execution permission" to all bin executables in "CodAn/bin" and "ToxCodAn/bin/":
chmod 777 path/to/ToxCodAn/bin/*
-
Then, run ToxCodAn as described in the "Usage" section.
-
To activate the environment to run ToxCodAn just use the command:
conda activate toxcodan_env -
To deactivate the environment just use the command:
conda deactivate -
:warning:Tip:warning: Ensure that you have added all conda channels properly:
conda config --add channels defaults && conda config --add channels bioconda && conda config --add channels conda-forge
Models
The model folder contains specific gHMM models and the toxinDB used in the ToxCodAn pipeline.
Download the models.zip file, uncompress (unzip models.zip) and specify it to the -m option of ToxCodAn command line (-m path/to/models/).
Usage
Usage: toxcodan.py [options]
Options:
-h, --help show this help message and exit
-s string, --sample=string
Optional - sample ID to be used in the output files
[default=toxcodan]
-t fasta, --transcripts=fasta
Mandatory - transcripts in FASTA format,
/path/to/transcripts.fasta
-o folder, --output=folder
Optional - output folder, /path/to/output_folder; if
not defined, the output folder will be set in the
current directory [ToxCodAn_output]
-m path, --model=path
Mandatory - path to model folder, /path/to/models
-p boolean value, --signalp=boolean value
Optional - turn on/off the signalP filtering step, use
True to turn on or False to turn off [default=True]
-P boolean value, --partial=boolean value
Optional - turn on/off the partial filtering step, use
True to turn on or False to turn off [default=False]
-n path, --nontoxinannotation=path
Optional - path to folder containing the protein DB
and CodAn model to be used in the NonToxin Annotation
pipeline [default=None]
-c int, --cpu=int Optional - number of threads to be used in each step
[default=1]
-f int, --covprefilter=int
Optional - threshold value used as the minimum
coverage in the pre-filter step [default=90]
-F int, --covtoxinfilter=int
Optional - threshold value used as the minimum
coverage in the toxin filter step [default=80]
Basic usage:
toxcodan.py -t transcripts.fa -m path/to/models
Check our tutorial to learn how to use ToxCodAn.
Inputs
ToxCodAn has the following inputs as mandatory:
- Transcripts in fasta format through the
-toption. - The uncompressed models folder through the
-moption
Outputs
ToxCodAn outputs the following files:
SampleID_Toxins_cds.fasta
SampleID_Toxins_pep.fasta
SampleID_Toxins_annotation.gtf
SampleID_Toxins_contigs.fasta
SampleID_PutativeToxins_cds.fasta
SampleID_PutativeToxins_contigs.fasta
SampleID_NonToxins_contigs.fasta
SampleID_Toxins_cds_SPfiltered.fasta (optional step)
SampleID_Toxins_pep_SPfiltered.fasta (optional step)
SampleID_Toxins_contigs_SPfiltered.fasta (optional step)
SampleID_Toxins_cds_SPfiltered_RedundancyFiltered.fasta (optional step)
signalp_annotation.gff (optional step)
RemoveRedundancy.log
Description of the output files:
cds -> coding sequence of the predicted toxins
pep -> protein sequence of the predicted toxins
contigs -> whole contigs containing the predicted CDSs
Toxins -> sequences with very high probability of being toxins
PutativeToxins -> sequences with medium/high probability of being toxins
NonToxins -> sequences with very low probability of being toxins
RedundancyFiltered -> CDSs with 100% identity filtered
SPfiltered -> signalP filtered sequences (optional step)
Annotation of Non Toxin transcripts
The user can take advantages of a simple script designed to annotate Non Toxin transcripts named NonToxinAnnotation.py. Follow the steps below:
- First, perform the CDS prediction with the "VERT_full" model using CodAn (reference Nachtigall et al. (2020))
codan.py -t path/to/NonToxins_contigs.fasta -m path/to/VERT_full/ -o path/to/output/NonToxins_codan/ -c N- We have a copy of the "VERT_full" in the "non_toxin_models" folder:
cd path/to/non_toxin_models/andgzip -d VERT_full
- Then, use the
NonToxinAnnotation.pyon the predicted CDSs. - This script performs
blastsearch (mandatory) and hmm search usingBUSCOandPfammodels (optional). - The use of a protein DB pre-compiled or designed with
makeblastdbcan be set with the-doption.- The user can use a DB such as Swissprot and/or the designed protein DB available at the "non_toxin_models" folder (just uncompress the DB
tar xjf pepDB.tar.bz2). - The user can set one or more DBs by using a comma "," among DBs, which can be any number (from 1 to N).
- The user can use a DB such as Swissprot and/or the designed protein DB available at the "non_toxin_models" folder (just uncompress the DB
- Optionally, the user can set any of the BUSCO models to perform hmm search by using the option
-b. - Optionally, the user can set the Pfam models to perform hmm search by using the option
-p. (link for download the pfam.hmm: ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.gz and decompress the model withgunzip Pfam-A.hmm.gz)- Please, notice that you may need to design auxfiles for the Pfam models before first use:
hmmpress Pfam-A.hmm
- Please, notice that you may need to design auxfiles for the Pfam models before first use:
- This script takes advantage of MultiThreading by using the option
-c. - Usage:
NonToxinAnnotation.py -t path/to/output/NonToxins_codan/ORF_sequences.fasta -d path/to/db1,...,path/to/dbN -b path/to/busco/odb -p path/to/pfam.hmm -c N
:warning: [Attention 1] If the user wants to speed up the process and use DIAMOND tool in the NonToxin annotation, just follow the steps below:
- The diamond tool can be installed through the command:
conda install -c bioconda diamond. - Design the diamond DB by using a set of protein sequences:
diamond makedb --in proteins.fasta -d diamondDB. - Then, use the
NonToxinAnnotation.pyon the predicted CDSs by setting the option-s diamondand the diamond DB in the-boption.NonToxinAnnotation.py -s diamond -t path/to/output/NonToxins_codan/ORF_sequences.fasta -d path/to/diamondDB -c N.- Keep the
-b path/to/busco/odb -p path/to/pfam.hmmoptions to perform the hmm search using BUSCO and Pfam models as described above.
:warning: [Attention 2] Alternatively, if the user wants to directly perform the NonToxins annota
