SkillAgentSearch skills...

MultiPhATE

Throughput PhATE processing of draft or finished phage genomes

Install / Use

/learn @carolzhou/MultiPhATE
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

multiPhATE v.1.0

/MultiPhate/ - multiPhATE

This code was developed by Carol L. Ecale Zhou and Jeffrey Kimbrel at Lawrence Livermore National Laboratory.

THIS CODE IS COVERED BY THE BSD LICENSE. SEE INCLUDED FILE BSD-3.pdf FOR DETAILS.

ANNOUNCEMENTS

This code will soon be deprecated, as the new version (multiPhATE2) is now released. You are encouraged to use the code at https://github.com/carolzhou/multiPhATE2/. The multiPhATE2 code has additional features not found in multiPhATE version 1.0.

The getDBs.py script (see below) is provided as a convenience for users to download and format databases used by multiPhATE. However, this script can break when the source URLs change. Therefore, it might be necessary to exert some manual effort in setting up the data for running multiPhATE. Thank you for your patience, and thank you for your interest in using multiPhATE. The codes and scripts at multiPhATE2 are being updated regularly. It is recommended that you now use the new version (multiPhATE2).

ABOUT THE MULTI-PHATE PIPELINE DRIVER

MultiPhATE is a throughput version of PhATE, which is described below. The multiPhate.py code is a command-line program that takes a single argument (hereafter referred to as, "multiPhate.config"; use sample.multiPhate.config as starting point) and uses it to generate a phate.config file (suitably named) for each genome being annotated. Then, multiPhate.py invokes the PhATE pipeline (via phate_runPipeline.py) for each genome.

ABOUT THE PHATE PIPELINE

PhATE is a fully automated computational pipeline for identifying and annotating phage genes in genome sequence. PhATE is written in Python 3.7, and runs on Linux and Mac operating systems. Code execution is controled by a configuration file, which can be tailored to run specific gene finders and to blast sequences against specific phage- and virus-centric data sets, in addition to more generic (genome, protein) data sets. PhATE runs at least one gene finding algorithm, then annotates the genome, gene, and protein sequences using blast and a set of fasta sequence databases, and uses an hmm search against the pVOG database. If more than one gene finder is run, PhATE will also provide a side-by-side comparison of the genes called by each gene caller. The user specifies the preferred gene caller, and the genes and proteins predicted by that caller are annotated using blast against the supporting databases. Classification of each protein sequence into a pVOG group is followed by generation of an alignment-ready fasta file. By convention, genome sequence files end with extension, ".fasta"; gene nucleotide fasta files end with, ".fnt", and cds amino-acid fasta files end with, ".faa".

HOW TO SET UP MULTI-PHATE ON YOUR LOCAL MACHINE

First, create a working directory on your computer for running multiPhATE. Then, acquire the multiPhATE package from github. This can be done either by downloading a zip file directly from the multiPhATE repository, or by cloning the repository. The first method is recommended, but the second is certainly an option:

*) To download the zip file: Use a browser and navigate to https://github.com/carolzhou/multiPhATE. Press the green button "Clone or download", and download the zip file. Then, unzip the package in your working (main execution "multiPhate") directory.

$ cd myMultiphateDir

$ unzip multiPhate-master.zip

*) To clone from github: Acquire git from https://git-scm.com/downloads. Naviate to your working (main execution "multiPhATE") directory, and clone multiPhATE from the command line:

$ git init

$ git clone https://github.com/carolzhou/mulitPhATE

(Complete instructions for using git and github can be found at http://help.github.com.)

Now, be sure that multiPhate.py and phate_runPipeline.py and associated files and directories are in your main execution "multiPhATE" directory. Check that the two subdirectories: PipelineInput/ and PipelineOutput/ are present (should already exist in the downloaded distribution). Place your phage genome fasta files (genome1.fasta, genome2.fasta, etc.) into the PipelineInput/ subdirectory. Place your configuration file (ie, your copy of sample.multiPhate.config) in the main execution directory (same level as multiPhate.py). A word of caution here: it is always best to name your files and fasta contigs as strings lacking any spaces or special characters, as third-party codes over which we have no control may balk when encountering odd characters or spaces.

You will need to acquire one or more of the databases listed below under SUPPORING DATABASES (Phantome and pVOGs are included in the multiPhATE distribution, so it is possible to begin with just those), and the 3rd party codes listed under SUPPORTING 3rd PARTY CODES. You will need to acquire at least one of the supported gene finders, but it is recommended to run as many of the four gene finders as is feasible so that the results can be more meaningfully compared. You will need to specifiy the locations of the supporting data sets and codes in the multiPhATE config file (see multiPhate.config), and you will need to locate your genome file(s) to the PipelineInput/ subdirectory. Once you have acquired the third-party codes and databases, you will be ready to configure the multiPhate.config file.

HOW TO WRITE A CONFIGURATION FILE

Summary: Availability and locations of supporting databases and codes are to be specified in a configuration file. A sample configuration file is provided, called "sample.multiPhate.config". Make a copy of this file and rename it accordingly (eg., myGenomeSet_multiPhate.config). Hereafter we refer to this file as, multiPhate.config. The multiPhate.config file is configured according to established default parameters (just about everything turned off initially). Any of the parameters may be modified (switches turned on or off) by assigning 'true' or 'false'. It is suggested that you turn swithes off, then install each supporting gene finder and database in turn and test the pipeline.

Procedure:

  1. At the command line, make a copy of the file, sample.multiPhate.config, and name it appropriately (hereafter referred to as 'multiPhate.config'): $ cp sample.multiPhate.config multiPhate.config. Then, edit your config file as described below.

  2. List of Genomes: For each genome to be processed, provide six lines under "Genome List:" and before "END of list": for each genome, you need to list the genome number, the name of the genome fasta file, the genome type (typically 'phage', but could be 'bacteria'), the species, if known (no spaces), the name of the genome, and a name for the output directory to hold this genome's output files (again, no spaces), in that order. You can simply copy/paste the six lines provided as many times as needed, and fill in the information appropriate for each genome.

  3. Processing Information: You may configure the pipeline to perform gene finding only, or gene finding plus functional annotation. For example, you may want to examine the results of multiple gene finders before going forward with functional annotation. In order to configure phate to run gene finding only, set translate_only to 'true'; in this way, only gene-calling and translation (to peptide sequence) will be performed. If you set translate_only to 'false', then the pipeline will not stop at the translation step, but will proceed with functional annotation of the predicted genes (ie, blast and/or hmm). Normally the genetic_code should be set to '11', for prokaryotic.

  4. Gene Callers: The gene_caller option specifies which gene caller's results (ie, gene calls) will be used for subsequent functional annotation. The choices are: 'phanotate', 'genemarks', 'prodigal', or 'glimmer'. To run a gene caller, you must have acquired that third-party code and installed it locally for use with multiPhATE. For each gene caller you wish to have run, set that caller's parameter to 'true'. In the usual case, you will want to specify gene_caller='phanotate' for annotation of phage genomes.

  5. Annotation: Set to 'true' each blast or hmm process that you want to be run. Note that you must have acquired the associated database, and in the next section (Databases) you must configure the location of each database. You may also set the desired blast parameters. The blast_identity sets the minimum identity that will be considered; any blast result below that threshold will be ignored. The hit_count parameters will determine how many top hits will be reported. Currently the only hmm_program that is supported by multiPhate is 'jackhmmer', and it is only run with the pVOGs database (future releases of multiPhate are expected to support additional hmm analyses).

  6. Databases: For each database that you have in-house, specify the full path/filename. Note that you may need to prepare in advance all blast databases by running the "makeblastdb" utility (see instructions with blast+ code for how to do that). MultiPhate will only run with blast+; it does not support legacy blast. For instructions where to download the databases, see the SUPPORTING DATABASES section below. Note that KEGG is available by license. Note also that in some cases additional files are required. In this case, place the additional file(s) in the same directory as the associated blast database. For example, place the NCBI accession2taxid file in the same directory as your NCBI virus genome file (see below). If you are downloading datasets that you anticipate using specifically with multiPhATE, then it is suggested, for convenience, that you save them in the Databases/ folder in the multiPhATE distribution, but any database can be located anywhere on your local system; you need only indicate in the multiPhate.config file the full path/filename for each database. Remember, the pVOGs and Phantome data sets are included in the multiPhATE distribution in the Databases/ folder, but you will need to run makeblastdb to render the

Related Skills

View on GitHub
GitHub Stars14
CategoryDevelopment
Updated3y ago
Forks7

Languages

Python

Security Score

60/100

Audited on Oct 3, 2022

No findings