inphared.pl

Providing up-to-date bacteriophage genome databases, metrics and useful input files for a number of bioinformatic pipelines including vConTACT2 and MASH. The aim is to produce a useful starting point for viral genomics and meta-omics.

Citation:

If you find our database useful, please see our recently published paper in PHAGE HERE

Cook R, Brown N, Redgwell T, Rihtman B, Barnes M, Clokie M, Stekel DJ, Hobman JL, Jones MA, Millard A. INfrastructure for a PHAge REference Database: Identification of Large-Scale Biases in the Current Collection of Cultured Phage Genomes. PHAGE. 2021. Available from: http://doi.org/10.1089/phage.2021.0007.

Shortcuts:

Let me skip running the script and just give me this month's data!
Description
Updates
Dependencies
Usage
Output Files
Supplementing and Annotating vConTACT2 Clusters
Annotating Phylogenetic Trees in IToL
Rapid Genome Comparisons using MASH
get_closest_relatives.pl
Contact

Let me skip running the script and just give me this month's data!

14Apr2025_data_excluding_refseq.tsv
14Apr2025_data.tsv
14Apr2025_genomes.db
14Apr2025_genomes_excluding_refseq.fa
14Apr2025_genomes.fa
14Apr2025_genomes.fa.msh
14Apr2025_itol_family_annotations.txt
14Apr2025_itol_genus_annotations.txt
14Apr2025_itol_host_annotations.txt
14Apr2025_itol_length_annotations.txt
14Apr2025_itol_lowest_taxa_annotations.txt
14Apr2025_itol_node_label_annotations.txt
14Apr2025_itol_subfamily_annotations.txt
14Apr2025_phages_downloaded_from_genbank.gb
14Apr2025_refseq_genomes.fa
14Apr2025_vConTACT2_family_annotations.tsv
14Apr2025_vConTACT2_gene_to_genome.csv
14Apr2025_vConTACT2_genus_annotations.tsv
14Apr2025_vConTACT2_host_annotations.tsv
14Apr2025_vConTACT2_lowest_taxa_annotations.tsv
14Apr2025_vConTACT2_proteins.faa
14Apr2025_vConTACT2_subfamily_annotations.tsv
PHROGs HMMs for consistent annotation of genomes (see the new --PHROG optional flag)
GenomesDB Directory (please note that this doesn't get updated each month, it's just here as a time-saver if you run the script yourself. This version is for 13/Dec/2021)

Description

inphared.pl (INfrastructure for a PHAge REference Database) is a perl script which downloads and filters phage genomes from Genbank to provide the most complete phage genome database possible.

Useful information, including viral taxonomy and bacterial host data, is extracted from the Genbank files and provided in a summary table. Genes are called on the genomes using Prokka and this output is used to gather metrics which are summarised in the output files, as well as useful input files for vConTACT2.

Updates

v1.7 (03-Mar-2022):

Added additional column to .tsv files which grabs any tags annotated as "host" or "lab_host", which is able to retrieve bacterial host for a number of genomes at the level of species. However, many of these values are inconsistent or not relating to the isolation host but to the environmental sample (e.g. wastewater).
When searching for MASH, the script will now automatically search for mash or mash.2. Should make it easier to find MASH on most user's systems.

v1.6 (02-Feb-2022):

Added lines to correct spelling of certain hosts in table output file (e.g. Klebsiella, where original Genbank record has Kelbsiella).

v1.5 (15-Dec-2021):

Added an optional flag to annotate genomes with HMMs produced from the PHROGs database. Download the HMMs for yourself HERE. Read about what we did HERE.

v1.4 (11-Nov-2021):

tsv files now include expanded taxa fields including genus, sub-family, family, order, class, phylum, kingdom and realm.

v1.3 (02-Aug-2021):

tsv files now include realm, Baltimore group, a warning flag for genomes with <50% coding capacity (may be issues with the assembly), and the Genbank designation (i.e. PHG, ENV)

v1.2 (18-Feb-2021):

Output files now written to directory, name of which can be specified (see usage).
List of excluded genomes now a separate file which can be edited and specified as a commandline argument (see usage).

v1.1 (09-Feb-2021):

Improved host data, particularly for Cyanophages.
Fixed issue with some Prokka versions outputting .gbf and others outputting .gbk, both will now be read by this script.

Dependencies

inphared.pl is a Perl script which makes calls to commandline utilities which must be installed and available in the PATH for the script to run. If it doesn't find one of these, it will print which dependency could not be found.

Prokka (available HERE)
MASH (available HERE)
efetch, esearch and efilter (available together as part of Entrez Direct: E-utilities HERE)

Usage

Note before running for first time: Upon first usage, it will take a long time to call genes on all of the genomes. This time can be reduced by downloading the existing GenomesDB/ directory from HERE. Download and unzip this tar archive in the directory you wish to run inphared.pl, so GenomesDB is a sub-directory of the desired working directory.

To run this script, use inphared.pl with the following command:

perl inphared.pl [options]

--exclusion <exclusion_list.txt> (-e): This flag allows the user to specify the location of a pipe-delimited file of accessions to be excluded from the analysis. We provide the file exclusion_list.txt which is continually updated but can be edited by the user. We recommend using this flag. If you find any incomplete genomes, please report these in the erroneous genomes discussion page.
--cpus <8> (-c): This flag allows users to specify the number of CPUs to be used in the Prokka step. This is a numeric argument and the default number is 8.
--outdir <directory> (-o): This flag allows users to specify the name of the output directory. If it doesn't already exist, the script will produce it. The default is inphared_date.
--help (-h): This flag will print a help menu to the screen without performing any analyses.
--PHROG (-P): This optional flag allows users to specify the path to HMMs made from the PHROGs database, for consistent annotation of genomes (download the HMMs for yourself HERE and read about them HERE).

Output Files

Output files will be written to a new directory named inphared_date unless a different name is specified. All output files will have the date of usage as a prefix. The summary of output files below uses 15th January 2021 as an example (although this prefix will obviously change).

| Output File | Description | | ----------------------------------------------- | ------------------------------------------------------------ | | 15Jan2021_phages_downloaded_from_genbank.gb | The raw Genbank files downloaded from NCBI. These are unfiltered and may contain poor or incomplete phage genomes. | | GenomesDB/

Inphared

Install / Use

README