PhageHosts

The code used for identification of the hosts of different phages. This is the complete code base used in * Robert A. Edwards, Katelyn McNair, Karoline Faust, Jeroen Raes, and Bas E. Dutilh (2015) Computational approaches to predict bacteriophage–host relationships. FEMS Microbiology Reviews doi: 10.1093/femsre/fuv048 ∗

Almost all of the code is written in python and should work with version 2.7. Some parts of the code require matplotlib, numpy, and/or scipy. Some parts of the code were written in Perl 5 and should run with the standard Perl libraries. The code was written on CentOS 6 machines and should run out of the box on those machines, but it will also run on Linux, MacOSX, or Windows. Parts of the code include reference to running things on our cluster where we use SGE as the job scheduler. You can run those parts without the cluster, but it will probably be slower!

The data/ directory includes some of the data sets that we used. We have not uploaded all genome sequences: you can get those from RefSeq.

Websites

See also the GitHub.io page and the Edwards lab

Datasets

Phages

The phage genomes were downloaded from RefSeq and parsed to get the information in the "host" field. These data were converted to coding and protein sequences:

For example, to convert from GBFF to FNA format for all of the open reading frames:

Download viral RefSeq data in GenBank Flat File and Nucleotide Fasta. Extract both those files. Then run this command:

    python PhageHosts/code/combine_gbff_fna.py viral.1.genomic.gbff viral.1.1.genomic.fna > viral.1.cds.fna

There are 971 phages in RefSeq that have a host annotation:

    perl -ne '@a=split /\t/; print if ($a[2])' refseq.txt | grep YES$ | grep -i 'complete genome' | wc -l

use perl PhageHosts/code/get_viral_dna.pl to split the viruses into either phage or eukaryotic viruses. For this work we are just going to use the Phage datasets. It is left up to the reader to try some of these challenges on Eukaryotic viral data sets.

Extracting the DNA and protein sequences

To get the phage coding sequences, we pull them out of the fasta file, and then translate them

    for i in $(cut -f 1 phage_with_host.tsv); do 
        grep -A 1 $i 2014_07_21/refseq/viral.1.cds.fna; done > data/phage_with_host.cds.fna
        perl PhageHosts/code/translate.pl phage_with_host.cds.fna > data/phage_with_host.cds.faa
    done

Bacteria

All bacterial sequences were downloaded from RefSeq:

    ncftpget ftp://ftp.ncbi.nih.gov/refseq/release/bacteria/bacteria.*.fna.gz

These were extracted and a single bacterial database was made for blastn searches

    cat bacteria.*fna > bacteria.genomic.fna

We extract the NC_ ids from the complete bacteria, above:

    grep \> bacteria.genomic.fna > ids.txt
    perl -i -npe 's/\:/\t/;s/^(.*)\|\s+/$1\|\t/' ids.txt

Then trim these down to complete genomes:

    egrep 'complete genome|complete sequence' ids.txt | grep -v plasmid | grep -v 'whole genome shotgun sequence' | grep -v NR_ > complete_genome_ids.txt

We edited this last file manually to ensure that we only have complete bacterial genomes.

We also downloaded the gbff files and orf files from RefSeq, and then used those to create a tbl file with both protein and CDS sequences, and create file called protein_list that has a list of all of the gbff files that we extracted.

    F=$(head -n $SGE_TASK_ID protein_list | tail -n 1)
    UF=$(echo $F | sed -e 's/.gz//')
    gunzip $F
    perl PhageHosts/code/genbank2flatfile.pl $UF
    gzip $UF

Based on this output we create two files, refseq_proteins.faa (with proteins) and refseq_orfs.faa (with DNA) from just the complete genomes.

    python PhageHosts/code/tbl2protdna.py .

Extracting taxonomy information

We wrote code to automatically get the taxonomy for most of the hosts from the RefSeq files, but there were a few that we could not map, so we added those manually. To whit:

	'Acinetobacter genomosp.' : '471',
	'Actinobacillus actinomycetemcomitans' : '714',
	'alpha proteobacterium' : '34025',
	'Bacillus clarkii' : '79879',
	'Brevibacterium flavum' : '92706',
	'Celeribacter sp.' : '875171',
	'Escherichia sp.' : '237777',
	'Geobacillus sp.' : '340407',
	'Gordonia rubropertincta' : '36822',
	'Iodobacter sp.' : '641420',
	'Listeria sp.' : '592375',
	'Marinomonas sp.' : '127794',
	'Methanobacterium thermoautotrophicum' : '145262',
	'methicillin-resistant Staphylococcus' : '1280',
	'Nitrincola sp.' : '459834',
	'Persicivirga sp.' : '859306',
	'Salisaeta sp.' : '1392396',
	'Sulfitobacter sp.' : '191468'

Then we generate the tsv file:

    python PhageHosts/code/phage_host_taxonomy.py  > phage_host_taxonomy.tsv.

We also just made two files with tuples of genome NC id and taxonomy id.

NOTE: When we score the connection between phage and host, we need to know the taxonomy ID of the host, not that of the phage, to see if there is a match. Therefore, in this file the taxonomy ID is that of the host (not the phage!)

    python PhageHosts/code/phage2taxonomy.py  > phage_host_taxid.txt

and another file refseq2taxonomy.py which added the tax id to the list of complete bacterial genomes, and then we made a list of bacteria and taxid:

    cut -f 2,4 /lustre/usr/data/NCBI/RefSeq/bacteria/complete_genome_ids_taxid.txt | perl -pe 's/^.*\|N/N/; s/\.\d+\|//' > bacteria_taxid.txt

We trimmed out any phages that we can not match at the species level:

    python2.7 PhageHosts/code/comparePhageToHosts.py

Then we combined those into a single file for all tax ids

    cat phage_taxid.txt bacteria_taxid.txt > data/all_host_taxid.txt

And add the taxonomy to those files:

    python PhageHosts/code/phage_host_add_taxonomy.py all_host_taxid.txt > data/all_host_taxid_taxonomy.txt

Note that in this process we deleted two phages whose hosts were not really known (NC_000935) APSE-1 whose host was Endosymbiont and Zamilon virophage (NC_022990) whose host is Mont1 megavirus)

Resulting files

This resulted in one key file that we use in this work data/all_host_taxid.txt which just has two columns, a RefSeq ID (which we sometimes call NC id) and a taxonomy ID. When the RefSeq ID refers to a bacterial sequence, the taxonomy ID is of the bacterium from where the sequence came. When the RefSeq ID refers to a phage sequence, the taxonomy ID is of the phage's host.

Scoring predictions

In general, we are going to print out tab separated files containing the phage ID in the first column, and the ID of any potential hosts in subsequent columns. We do not necessarily know how many columns there will be. This is a flexible format that allows us to convert those IDs to taxonomy IDs, and then use the NCBI hierarchy to move through the phylogenetic tree to score how good our matches are.

We have a few keys pieces of code for this work:

NC2taxid.py is python code that converts any RefSeq ID (NC_\d+) into its associated taxonomy ID based on the association above, the bacterial RefSeq IDs map to bacterial taxonomy IDs and the phage RefSeq IDs map to host taxonomy IDs.
scoreTaxId.py is python code that takes the a set of taxonomy IDs and scores all subsequent elements in the set against the first member of the set, using 'species', 'genus', 'family', 'order', 'class', 'phylum', 'superkingdom' taxonomic levels from NCBI.

1. Genetic Homology

Comparing the phages against the complete bacterial genomes (blastx)

Start by making a database of just the complete genomes protein sequences

    PhageHosts/code/refseq2complete.py $HOME/phage/host_analysis/bacteria_taxid.txt  refseq_proteins.faa complete_genome_proteins.faa

and then blast those:

    PhageHosts/code/split_blast_queries_edwards_blastplus.pl -f ../../phage_with_host.fna -n 100 -p blastx -d phage.blastx -db RefSeq/bacteria/complete_genome_proteins.faa -evalue 0.01
    cat phage.blastx/*blastx > phage.complete.blastx

NOTE: There are mutliple proteins with the same ID that come from different genomes, however these are identical proteins (at the amino acid level). Therefore the blastx searches all return the same results for each protein. I use this one line of Perl code to print out unique solutions from the blastx output.

    perl -ne 'print unless ($s{$_}); $s{$_}=1' phage.complete.blastx > phage.complete.unique.blastx

We convert the output so we just have NC identifiers:

    python PhageHosts/code/blastx2genome.py phage.complete.unique.blastx phage.genomes.blastx

Now we just count the hosts with the most number of hits to each of the phages, and score those hits

    python PhageHosts/code/mostBlastHits.py phage.genomes.blastx > most.tax
    python2.7 PhageHosts/code/scoreTaxId.py most.tax > score.tax

Comparing phages to the non-redundant (nr) database.

To see what would happen, we also compared the phages to all the proteins in the non-redundant database. Without going into the results in any detail, they weren't as good as using the complete genomes because there are more proteins in the nr database and thus more confusion from the similarity searches.

We used blast to compare the complete phage genomes against all bacterial proteins in the GenBank nr and then gi_taxid table of the NCBI Taxonomy Site (here is an alternate link as there is an issue linking to FTP sites from github) to get the taxonomy id of the top hits. We com

PhageHosts

Install / Use

README

PhageHosts

Websites

Datasets

Phages

Extracting the DNA and protein sequences

Bacteria

Extracting taxonomy information

Resulting files

Scoring predictions

1. Genetic Homology

Comparing the phages against the complete bacterial genomes (blastx)

Comparing phages to the non-redundant (nr) database.