GitHub language count GitHub last commit (branch)

Docker Pulls

Sphae

Phage toolkit to detect phage candidates for phage therapy

Overview

The steps that sphae takes are shown here:

This snakemake workflow was built using Snaketool [https://doi.org/10.1371/journal.pcbi.1010705], to assemble and annotate phage sequences. Currently, this tool is being developed for phage genomes. The steps include,

Quality control that removes adaptor sequences, low-quality reads and host contamination (optional).
Assembly
Contig quality checks; read coverage, viral or not, completeness, and assembly graph components.
Phage genome annotation

Cite Sphae: https://doi.org/10.1093/bioadv/vbaf004

If you are new to bioinformatics or running command line tools, here is a great tutorial to follow: https://github.com/AnitaTarasenko/sphae/wiki/Sphae-tutorial

Install

Pip install

#creating a new envrionment
conda create -y -n sphae python=3.13
conda activate sphae
#install sphae 
pip install sphae

Conda install

Setting up a new conda environment

conda create -n sphae python=3.13
conda activate sphae

Container Install There are two versions of the container

Sphae v1.5.2 Includes databases, so the container is about 32GB

Steps to donwload and run this container

 TMPDIR=<where your tmpdir lives>
 IMAGEDIR-<where you want the image to live>
 
 singularity pull --tmpdir $TMPDIR --dir $IMAGEDIR docker://npbhavya/sphae:latest
 singularity exec sphae_latest.sif sphae --help
 singularity exec sphae_latest.sif sphae run --help


 singularity exec -B <path/to/inputfiles>:/input,<path/to/output>:/output sphae_latest.sif sphae run --input /input --output /output

Sphae v1.5.2-noDB This version, doesnt come with databases. So the first step would be download the databases locally and save them to one directory <path/to/databases>.

Here are the commands to download sphae container

TMPDIR=<where your tmpdir lives>
IMAGEDIR-<where you want the image to live>

singularity pull --tmpdir $TMPDIR --dir $IMAGEDIR docker://npbhavya/sphae:v1.5.2-noDB
singularity exec sphae_v1.5.2-noDB.sif sphae --help
singularity exec sphae_v1.5.2-noDB.sif sphae run --help

# <path/to/databases> set to sphae/workflow/databases if sphae install is run 
singularity exec -B <path/to/databases>:/database,<path/to/inputfiles>:/input,<path/to/output>:/output sphae_latest.sif sphae run --input /input --output /output

Source install

#clone sphae repository
git clone https://github.com/linsalrob/sphae.git

#move to sphae folder
cd sphae

#install sphae
pip install -e .

#confirm the workflow is installed by running the below command 
sphae --help

Installing databases

Run the below command,

#Installs the database to default directory, `sphae/workflow/databases`
sphae install

#Install database to specific directory
sphae install --db_dir <directory>

Install the databases to a directory, sphae/workflow/databases

This workflow requires the

Pfam35.0 database to run viral_verify for contig classification.
CheckV database to test for phage completeness
Pharokka databases
Phynteny models
Phold databases
Medaka models
PhageTermvirome-4.3 install

This step requires ~23G of storage If these databases are already installed, skip this step and instead set the envrionment variables pointing to the where these databases are installed

#Note to change the file path to the databases.
#For instance if sphae was installed using conda, the databases by default will be saved to /home/username/miniforge3/envs/sphae/lib/python3.11/site-packages/sphae/workflow/databases

export VVDB=sphae/workflow/databases/Pfam35.0/Pfam-A.hmm.gz
export CHECKVDB=sphae/workflow/databases/checkv-db-v1.5
export PHAROKKADB=sphae/workflow/databases/pharokka_db
export PHYNTENYDB=sphae/workflow/databases/models
export PHOLDDB=sphae/workflow/databases/phold

Running the workflow

Sphae is developed to be modular:

sphae run will run QC, assembly and annotation
sphae annotate will run only annotation steps

Commands to run

Only one command needs to be submitted to run all the above steps: QC, assembly and assembly stats

#For illumina reads, place the reads both forward and reverse reads to one directory
#Make sure the fastq reads are saved as {sample_name}_R1.fastq and {sample_name}_R2.fastq or with extensions {sample_name}_R1.fastq.gz
sphae run --input tests/data/illumina-subset --output example -k

#For nanopore reads, place the reads, one file per sample in a directory
sphae run --input tests/data/nanopore-subset --sequencing longread --output example -k

#For newer ONT sequencing data where polishing is not required, run the command
sphae run --input tests/data/nanopore-subset --sequencing longread --output example -k --no_medaka

#To run either of the commands on the cluster, add --executor slurm to the command. There is a little bit of setup to do here.
#Setup a ~/.config/snakemake/slurm/config.yaml file - https://snakemake.github.io/snakemake-plugin-catalog/plugins/executor/slurm.html#advanced-resource-specifications
#I may have set this workflow to run only slurm right now, will make it more generic soon.
sphae run --input tests/data/nanopore-subset --preprocess longread --output example --profile slurm -k --threads 16

Command to run only annotation steps and phylogenetic trees This step reruns

Pharokka, Phold, Phynteny
Phylogenetic tree with terminase large subunit, portal protein

#the genomes directory has the already assembled complete genomes
#run the export commands to set the database paths 
sphae annotate --genome <genomes directory> --output example -k

Output

Output for sphae run, is saved to example/RESULTS directory. In this directory, there will be four files

Genome annotations in GenBank format (Phynteny output)
Genome in fasta format (either the reoriented to terminase output from Pharokka, or assembled viral contigs)
Circular visualization in png format (Pharokka output)
Genome summary file
trees folder; Not this folder might be meaningful only if you have tailed phages
all_portal.nwk: Tree using all proteins annotated as "portal protein:
all_terL.nwk: Tree using all proteins annotated as "terminase large subunit"
PhageTerm results saved to a directory - <sample name>_phageterm (only for paired end sequencing)

Genome summary file includes the following information to help,

Sample name
Length of the genome
Coding density
If the assembled contig is circular or not (From checkv)
Completeness (calculated from CheckV)
Contamination (calculated from CheckV)
Taxonomy accession ID (Pharokka output, searches the genome against INPHARED database using mash)
Taxa mash includes the number of matching hashes of the assembled genome to the accession ID/Taxa name. Higher the matching hash- more likely the genome is related to the taxa predicted
Gene searches:
- Whether integrase is found (search for integrase gene in annotations)
- Whether anti-microbial genes were found (Phold and Pharokka search against AMR database)
- Whether any virulence factors were found (Pharokka search against virulence gene database)
- Whether any CRISPR spacers were found (Pharokka search against MinCED database)

Output for sphae annotate is saved to example/final-annotate directory. In this directory there will be;

Genome annotations in GenBank format (Phynteny output)
Genome in fasta format (either the reoriented to terminase output from Pharokka, or assembled viral contigs)
Circular visualization in png format (Pharokka output)
Genome summary file
trees folder; Not this folder might be meaningful only if you have tailed phages
all_portal.nwk: Tree using all proteins annotated as "portal protein:
all_terL.nwk: Tree using all proteins annotated as "terminase large subunit"

Genome summary file includes the following information to help,

Sample name
Taxa mash includes the number of matching hashes of the assembled genome to the acc

Sphae

Install / Use

README

Sphae

Phage toolkit to detect phage candidates for phage therapy

Install

Installing databases

Running the workflow