PhastSim
fast sequence evolution simulation for SARS-CoV-2 and similar
Install / Use
/learn @NicolaDM/PhastSimREADME
phastSim
Fast sequence evolution simulation for SARS-CoV-2 phylogenies and other genomic epidemiological datasets.
Table of contents
Installation and dependencies
Installation
phastSim is available either through PyPi or by cloning this repository directly:
pip install phastSim
or
git clone https://github.com/NicolaDM/phastSim
After cloning the repository, local installation with pip might still be require, for example executing
pip install -e .
within the phastSim directory.
Dependencies
phastSim requires the Python packages numpy, importlib_resources, ete3, biopython, protobuf, and google. These packages are all available through both PyPi:
pip install numpy importlib-resources six ete3 biopython protobuf google
and conda:
conda install -c defaults -c etetoolkit -c conda-forge numpy importlib_resources six ete3 biopython protobuf google
Versions used for testing were: python=3.9.1, numpy=1.19.2, importlib_resources=5.1.0, ete3=3.1.2, biopython=1.78, protobuf=3.15.7 and google=3.0.0.
Usage
If installation was performed using pip, you can run phastSim using:
mkdir simulation_output
phastSim --outpath simulation_output/ --seed 7 --createFasta --createInfo \
--createNewick --createPhylip --treeFile [tree_name.newick] \
--scale 0.333333333 --invariable 0.1 --alpha 1.0 --omegaAlpha 1.0 \
--hyperMutProbs 0.01 0.01 --hyperMutRates 20.0 200.0 --codon \
--reference [ref_genome.fasta]
where --treeFile [tree_name.newick] specifies a tree file in Newick format, and --reference [ref_genome.fasta] specifies a genome sequence provided as a single-record FASTA file.
Alternatively, if you have cloned this repository, an example command to run the main phastSim script from the terminal is:
mkdir simulation_output
python bin/phastSim --outpath simulation_output/ --seed 7 --createFasta --createInfo \
--createNewick --createPhylip --treeFile phastSim/example/example_sarscov2_tree.newick \
--scale 0.333333333 --invariable 0.1 --alpha 1.0 --omegaAlpha 1.0 \
--hyperMutProbs 0.01 0.01 --hyperMutRates 20.0 200.0 --codon \
--reference phastSim/example/MN908947.3.fasta
which uses an example SARS-CoV-2 tree included here, specified by --treeFile phastSim/example/example_sarscov2_tree.newick, and a reference SARS-CoV-2 genome in FASTA format, provided here and specified by --reference phastSim/example/MN908947.3.fasta.
As input, phastSim requires a tree in Newick format (specified using --treeFile, as shown in the examples above). This script will then simulate sequence evolution along the tree using a Gillespie algorithm, and will output efficiently represented genome sequences.
This approach is more efficient than alternatives when branch lengths are short, such as in SARS-CoV-2 and other genomic epidemiological datasets.
When a large proportion of the genome is expected to mutate on each branch, however, this approach might be slower than traditional methods.
Given a large phylogeny (say, >10,000 tips) and divergence levels typical for genomic epidemiology, this approach will be faster than other methods.
Even when simulating under a codon model, whole genome, and 100,000 sequences, simulations should only take a few seconds.
For small phylogenies (<1,000 tips) or long branches (many mutation events per branch on average), the software seq-gen by Rambaut and Grassly will typically be more efficient.
PhastSim offers a broad choice of sequence evolution models, including indels (similarly to INDELible by Fletcher and Yang), non-reversible non-stationary substitution models, and codon models.
Output can be created in FASTA format (--createFasta) and/or PHYLIP format (--createPhylip); in the case of simulations with indels, the FASTA and PHYLIP formats will contain unaligned sequences.
The standard output is given as a list of modified bases w.r.t. the reference for each sample.
The history of mutation events can also be created and stored in an annotated Newick format (--createNewick).
A .info text file, describing the evolutionoary rates for each site of the genome, can also be created ('--createInfo').
Nucleotide or codon evolution models are allowed. A nuclotide model can defined by a set of 12 rates, each for a non-diagonal entry of an UNREST substitution matrix https://doi.org/10.1089/1066527041410472; this allows the inderct specification of any nucleotide substitution model, but the user can alternatively also directly select a GTR, HKY, oor JC model. By default, neutral mutation rates as inferred from SARS-CoV-2 https://doi.org/10.1101/2021.01.14.426705 are assumed.
We allow users to specify mutation rate variation in multiple ways:
- With a finite number of discrete mutation rate categories, or a continuous gamma distribution.
- Jointly with 1., a proportion of non-mutable sites can also be specified.
- Jointly with both 1. and 2., a new model of rate variation describing hypermutability, as observed for example in SARS-CoV-2, can also be specified. This model is triggered by options
--hyperMutProbsand--hyperMutRates. In this model, small proportions of sites are given a (possibly much higher) mutation rate. For any of these sites, only one specific mutation rate is enhanced. For example, one site might have only the G→T mutation rate increased 100-fold, while all other rates at that site remain the same. This is to model the effects observed in SARS-CoV-2 which are possibly attributable to APOBEC, ROS activity, or other mechanisms which remain unclear (see https://doi.org/10.1101/2021.01.14.426705).
A codon model is built on top of a model of nucleotide mutation and mutation rate variation, as explained in the two paragraphs above. See also https://doi.org/10.1093/molbev/mss266. In addition, a codon model has a codon-specific omega parameter that increases or decreases the rate of nonsynonymous changes at the given codon. Variation in omega can be defined with a finite number of omega classes, or with a continuous gamma distribution.
Thanks to the new algorithm in phastSim, the computational demand is almost unaffected by the complexity of chosen evolution model.
Rates in phastSim are normalized so that the expected number of substitutions per nucleotide site per unit branch length at the root is 1.
This is true even when simulating under a codon model. If instead one wants branch lengths to be interpreted as number of expected substitutions per codon per unit time, then one needs to rescale the input tree by a factor of 3 using, for example, option --scale 0.3333333. Finally, since the substitution model will usually not be assumed to be at equilibrium, the total evolution rate might typically decrease a bit moving downstream from the root to the tips of the phylogenetic tree.
Indels can also be simulated, under a similar model as INDELible, using option --indels. Insertion rate and deletion rate can separately bespecified with options --insertionRate and --deletionRate. Insertion length distribution and deletion length distribution can be specified with options --insertionLength and --deletionLength, which offer a number of possible types of distributions.
Other included scripts
Other scripts (in the "scripts" subdirectory) which are not required to run phastSim, but are useful for support, are also included in this repository. These are:
-
random_tree.py, which efficiently generates a random birth-death tree with many tips, and
-
compareSimulators.py, which is used to compare the running time of phastSim with other simulators (Seq-Gen, INDELible and pyvolve, you will need to install these yourself and provide some command line arguments telling the script where these are installed).
-
some_experiment.sh, a few bash scripts which automate the experiments used to produce the figures in the paper.
Tests
Tests are not required to run phastSim but can be helpful for development, and are also included in this repository. Tests make use of the pytest package. During development, it can be useful to install phastSim in editable mode; to do this run
pip install -e .
from the base directory of the project.
To actually run the tests, do pytest <name of test>, and use the -s flag if you want the print statements to get printed to stdout.
Tutorial
This tutorial covers the most common use cases for phastSim, as well as some edge cases which might otherwise be quite subtle or ambiguous.
Basic usage
After installation (perhaps simplest through PyPi: pip install phastSim), all of the options available in phastSim are accessed as command-line
arguments.
phastSim --outpath mydirectory/
Run phastSim using all the default values, and output a succinct txt output file in the output directory. The default tree is found in phastSim/example/exampleTree.tree, while the default reference is the SARS-CoV-2 Wuhan reference genome, in the same folder. The default model is an UNREST nucleotide substitution model, with substitution rates tuned for the SARS-CoV-2 genome.
phastSim --outpath mydirectory/ --reference myreference.fasta --mutationRates HK
Related Skills
node-connect
349.2kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
109.5kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
349.2kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
349.2kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
