Sphae
Phage annotations and predictions. A spae is a prediction or foretelling. We'll foretell you what your phage is doing!
Install / Use
/learn @linsalrob/SphaeREADME
Sphae
Phage toolkit to detect phage candidates for phage therapy
<p align="center"> <img src="logo/sphae.png#gh-light-mode-only" width="300"> <img src="logo/sphaedark.png#gh-dark-mode-only" width="300"> </p>Overview
The steps that sphae takes are shown here:
<p align="center"> <img src="logo/sphae_steps.png#gh-light-mode-only" width="300"> </p>This snakemake workflow was built using Snaketool [https://doi.org/10.1371/journal.pcbi.1010705], to assemble and annotate phage sequences. Currently, this tool is being developed for phage genomes. The steps include,
- Quality control that removes adaptor sequences, low-quality reads and host contamination (optional).
- Assembly
- Contig quality checks; read coverage, viral or not, completeness, and assembly graph components.
- Phage genome annotation
Cite Sphae: https://doi.org/10.1093/bioadv/vbaf004
If you are new to bioinformatics or running command line tools, here is a great tutorial to follow: https://github.com/AnitaTarasenko/sphae/wiki/Sphae-tutorial
Install
Pip install
#creating a new envrionment
conda create -y -n sphae python=3.13
conda activate sphae
#install sphae
pip install sphae
Conda install
Setting up a new conda environment
conda create -n sphae python=3.13
conda activate sphae
Container Install There are two versions of the container
-
Sphae v1.5.2 Includes databases, so the container is about 32GB
Steps to donwload and run this container
TMPDIR=<where your tmpdir lives> IMAGEDIR-<where you want the image to live> singularity pull --tmpdir $TMPDIR --dir $IMAGEDIR docker://npbhavya/sphae:latest singularity exec sphae_latest.sif sphae --help singularity exec sphae_latest.sif sphae run --help singularity exec -B <path/to/inputfiles>:/input,<path/to/output>:/output sphae_latest.sif sphae run --input /input --output /output -
Sphae v1.5.2-noDB This version, doesnt come with databases. So the first step would be download the databases locally and save them to one directory
<path/to/databases>.Here are the commands to download sphae container
TMPDIR=<where your tmpdir lives> IMAGEDIR-<where you want the image to live> singularity pull --tmpdir $TMPDIR --dir $IMAGEDIR docker://npbhavya/sphae:v1.5.2-noDB singularity exec sphae_v1.5.2-noDB.sif sphae --help singularity exec sphae_v1.5.2-noDB.sif sphae run --help # <path/to/databases> set to sphae/workflow/databases if sphae install is run singularity exec -B <path/to/databases>:/database,<path/to/inputfiles>:/input,<path/to/output>:/output sphae_latest.sif sphae run --input /input --output /output
Source install
#clone sphae repository
git clone https://github.com/linsalrob/sphae.git
#move to sphae folder
cd sphae
#install sphae
pip install -e .
#confirm the workflow is installed by running the below command
sphae --help
Installing databases
Run the below command,
#Installs the database to default directory, `sphae/workflow/databases`
sphae install
#Install database to specific directory
sphae install --db_dir <directory>
Install the databases to a directory, sphae/workflow/databases
This workflow requires the
- Pfam35.0 database to run viral_verify for contig classification.
- CheckV database to test for phage completeness
- Pharokka databases
- Phynteny models
- Phold databases
- Medaka models
- PhageTermvirome-4.3 install
This step requires ~23G of storage If these databases are already installed, skip this step and instead set the envrionment variables pointing to the where these databases are installed
#Note to change the file path to the databases.
#For instance if sphae was installed using conda, the databases by default will be saved to /home/username/miniforge3/envs/sphae/lib/python3.11/site-packages/sphae/workflow/databases
export VVDB=sphae/workflow/databases/Pfam35.0/Pfam-A.hmm.gz
export CHECKVDB=sphae/workflow/databases/checkv-db-v1.5
export PHAROKKADB=sphae/workflow/databases/pharokka_db
export PHYNTENYDB=sphae/workflow/databases/models
export PHOLDDB=sphae/workflow/databases/phold
Running the workflow
Sphae is developed to be modular:
sphae runwill run QC, assembly and annotationsphae annotatewill run only annotation steps
Commands to run
Only one command needs to be submitted to run all the above steps: QC, assembly and assembly stats
#For illumina reads, place the reads both forward and reverse reads to one directory
#Make sure the fastq reads are saved as {sample_name}_R1.fastq and {sample_name}_R2.fastq or with extensions {sample_name}_R1.fastq.gz
sphae run --input tests/data/illumina-subset --output example -k
#For nanopore reads, place the reads, one file per sample in a directory
sphae run --input tests/data/nanopore-subset --sequencing longread --output example -k
#For newer ONT sequencing data where polishing is not required, run the command
sphae run --input tests/data/nanopore-subset --sequencing longread --output example -k --no_medaka
#To run either of the commands on the cluster, add --executor slurm to the command. There is a little bit of setup to do here.
#Setup a ~/.config/snakemake/slurm/config.yaml file - https://snakemake.github.io/snakemake-plugin-catalog/plugins/executor/slurm.html#advanced-resource-specifications
#I may have set this workflow to run only slurm right now, will make it more generic soon.
sphae run --input tests/data/nanopore-subset --preprocess longread --output example --profile slurm -k --threads 16
Command to run only annotation steps and phylogenetic trees This step reruns
- Pharokka, Phold, Phynteny
- Phylogenetic tree with terminase large subunit, portal protein
#the genomes directory has the already assembled complete genomes
#run the export commands to set the database paths
sphae annotate --genome <genomes directory> --output example -k
Output
Output for sphae run, is saved to example/RESULTS directory. In this directory, there will be four files
- Genome annotations in GenBank format (Phynteny output)
- Genome in fasta format (either the reoriented to terminase output from Pharokka, or assembled viral contigs)
- Circular visualization in
pngformat (Pharokka output) - Genome summary file
- trees folder; Not this folder might be meaningful only if you have tailed phages
- all_portal.nwk: Tree using all proteins annotated as "portal protein:
- all_terL.nwk: Tree using all proteins annotated as "terminase large subunit"
- PhageTerm results saved to a directory - <sample name>_phageterm (only for paired end sequencing)
Genome summary file includes the following information to help,
- Sample name
- Length of the genome
- Coding density
- If the assembled contig is circular or not (From checkv)
- Completeness (calculated from CheckV)
- Contamination (calculated from CheckV)
- Taxonomy accession ID (Pharokka output, searches the genome against INPHARED database using mash)
- Taxa mash includes the number of matching hashes of the assembled genome to the accession ID/Taxa name. Higher the matching hash- more likely the genome is related to the taxa predicted
- Gene searches:
- Whether integrase is found (search for integrase gene in annotations)
- Whether anti-microbial genes were found (Phold and Pharokka search against AMR database)
- Whether any virulence factors were found (Pharokka search against virulence gene database)
- Whether any CRISPR spacers were found (Pharokka search against MinCED database)
Output for sphae annotate is saved to example/final-annotate directory. In this directory there will be;
- Genome annotations in GenBank format (Phynteny output)
- Genome in fasta format (either the reoriented to terminase output from Pharokka, or assembled viral contigs)
- Circular visualization in
pngformat (Pharokka output) - Genome summary file
- trees folder; Not this folder might be meaningful only if you have tailed phages
- all_portal.nwk: Tree using all proteins annotated as "portal protein:
- all_terL.nwk: Tree using all proteins annotated as "terminase large subunit"
Genome summary file includes the following information to help,
- Sample name
- Taxa mash includes the number of matching hashes of the assembled genome to the acc
