SkillAgentSearch skills...

Inphared

Providing up-to-date phage genome databases, metrics and useful input files for a number of bioinformatic pipelines.

Install / Use

/learn @RyanCook94/Inphared
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

inphared.pl

Providing up-to-date bacteriophage genome databases, metrics and useful input files for a number of bioinformatic pipelines including vConTACT2 and MASH. The aim is to produce a useful starting point for viral genomics and meta-omics.

Citation:

If you find our database useful, please see our recently published paper in PHAGE HERE

Cook R, Brown N, Redgwell T, Rihtman B, Barnes M, Clokie M, Stekel DJ, Hobman JL, Jones MA, Millard A. INfrastructure for a PHAge REference Database: Identification of Large-Scale Biases in the Current Collection of Cultured Phage Genomes. PHAGE. 2021. Available from: http://doi.org/10.1089/phage.2021.0007.

Shortcuts:

Let me skip running the script and just give me this month's data!

Description

inphared.pl (INfrastructure for a PHAge REference Database) is a perl script which downloads and filters phage genomes from Genbank to provide the most complete phage genome database possible.

Useful information, including viral taxonomy and bacterial host data, is extracted from the Genbank files and provided in a summary table. Genes are called on the genomes using Prokka and this output is used to gather metrics which are summarised in the output files, as well as useful input files for vConTACT2.

Updates

v1.7 (03-Mar-2022):

  • Added additional column to .tsv files which grabs any tags annotated as "host" or "lab_host", which is able to retrieve bacterial host for a number of genomes at the level of species. However, many of these values are inconsistent or not relating to the isolation host but to the environmental sample (e.g. wastewater).
  • When searching for MASH, the script will now automatically search for mash or mash.2. Should make it easier to find MASH on most user's systems.

v1.6 (02-Feb-2022):

  • Added lines to correct spelling of certain hosts in table output file (e.g. Klebsiella, where original Genbank record has Kelbsiella).

v1.5 (15-Dec-2021):

  • Added an optional flag to annotate genomes with HMMs produced from the PHROGs database. Download the HMMs for yourself HERE. Read about what we did HERE.

v1.4 (11-Nov-2021):

  • tsv files now include expanded taxa fields including genus, sub-family, family, order, class, phylum, kingdom and realm.

v1.3 (02-Aug-2021):

  • tsv files now include realm, Baltimore group, a warning flag for genomes with <50% coding capacity (may be issues with the assembly), and the Genbank designation (i.e. PHG, ENV)

v1.2 (18-Feb-2021):

  • Output files now written to directory, name of which can be specified (see usage).
  • List of excluded genomes now a separate file which can be edited and specified as a commandline argument (see usage).

v1.1 (09-Feb-2021):

  • Improved host data, particularly for Cyanophages.
  • Fixed issue with some Prokka versions outputting .gbf and others outputting .gbk, both will now be read by this script.

Dependencies

inphared.pl is a Perl script which makes calls to commandline utilities which must be installed and available in the PATH for the script to run. If it doesn't find one of these, it will print which dependency could not be found.

  • Prokka (available HERE)
  • MASH (available HERE)
  • efetch, esearch and efilter (available together as part of Entrez Direct: E-utilities HERE)

Usage

Note before running for first time: Upon first usage, it will take a long time to call genes on all of the genomes. This time can be reduced by downloading the existing GenomesDB/ directory from HERE. Download and unzip this tar archive in the directory you wish to run inphared.pl, so GenomesDB is a sub-directory of the desired working directory.

To run this script, use inphared.pl with the following command:

perl inphared.pl [options]

  • --exclusion <exclusion_list.txt> (-e): This flag allows the user to specify the location of a pipe-delimited file of accessions to be excluded from the analysis. We provide the file exclusion_list.txt which is continually updated but can be edited by the user. We recommend using this flag. If you find any incomplete genomes, please report these in the erroneous genomes discussion page.
  • --cpus <8> (-c): This flag allows users to specify the number of CPUs to be used in the Prokka step. This is a numeric argument and the default number is 8.
  • --outdir <directory> (-o): This flag allows users to specify the name of the output directory. If it doesn't already exist, the script will produce it. The default is inphared_date.
  • --help (-h): This flag will print a help menu to the screen without performing any analyses.
  • --PHROG (-P): This optional flag allows users to specify the path to HMMs made from the PHROGs database, for consistent annotation of genomes (download the HMMs for yourself HERE and read about them HERE).

Output Files

Output files will be written to a new directory named inphared_date unless a different name is specified. All output files will have the date of usage as a prefix. The summary of output files below uses 15th January 2021 as an example (although this prefix will obviously change).

| Output File | Description | | ----------------------------------------------- | ------------------------------------------------------------ | | 15Jan2021_phages_downloaded_from_genbank.gb | The raw Genbank files downloaded from NCBI. These are unfiltered and may contain poor or incomplete phage genomes. | | GenomesDB/

Related Skills

View on GitHub
GitHub Stars80
CategoryData
Updated15d ago
Forks9

Languages

Perl

Security Score

95/100

Audited on Mar 13, 2026

No findings