RWTY: R We There Yet?

This package implements various tests, visualizations, and metrics for diagnosing convergence and mixing of MCMC chains in phylogenetics. It implements and automates many of the functions of the AWTY package in the R environment. It also adds a number of new functions not available in AWTY.

Citation

Warren D.L., A.J. Geneva, and R. Lanfear. 2017. RWTY (R We There Yet): An R package for examining convergence of Bayesian phylogenetic analyses. Molecular Biology and Evolution 34:1016-1020. doi: 10.1093/molbev/msw279

Support

For help with RWTY problems, please check out the Google Group for RWTY.

Installation

At present, RWTY is downloadable from https://github.com/danlwarren/RWTY. There are multiple ways to download it. The easiest is to use devtools and install from GitHub.

Installing from GitHub using devtools

Run the following code from your R console:

install.packages("devtools")
library(devtools)
install_github("danlwarren/RWTY")
library(rwty)

Install from zip file

A zipped version of the package is available at https://github.com/danlwarren/RWTY/archive/master.zip. To install from the zip file, download a copy of it to your system. Once it's finished downloading, type the following (where PATH is the path to the zip file):

install.packages("devtools")
library(devtools)
install_local("PATH")
library(rwty)

Setting the number of cores

Some of the calculations RWTY performs benefit from the availability of multiple processor cores. To set the number of cores used by RWTY, all you need to do is assign a global variable called "rwty.processors".

rwty.processors <<- 8

If no rwty.processors object exists in the global environment, RWTY will default to using N-1 cores, where N is the total number available on your system. Due to the limits of R's internal parallelization, this functionality is not available on Windows computers.

Interacting with RWTY

RWTY works with lists of rwty.chain objects. Each rwty.chain object contains outputs from a single Bayesian phylogenetic MCMC analysis, e.g. from BEAST or MrBayes. A rwty.chain object must contain the tree topologies from the chain (e.g. from the .t file from MrBayes), and can optionally contain the continuous parameters as well (e.g. from the .p file from MrBayes). A list of rwty.chain objects should all come from analyses of the same taxa. Most often, a list of rwty.chain objects will contain multiple independent replicates of a single MCMC analysis, but it could also contain independent MCMC analyses of different loci from the same set of taxa, or a combination of both (e.g. 4 replicate analyses of 3 different loci).

A lot of analyses in RWTY can be done using just three RWTY functions: load.trees(), load.multi(), and analyze.rwty(). Other functions you might use are any of the functions that start with makeplot., and the topological.approx.ess() function.

`load.trees()`

This function is used to load in a single MCMC chain, where PATH representing the path to the tree file. It will load the tree file and attempt find an associated log file, with the type determined by the "format" command. The default format is for MrBayes, where tree files have a ".t" extension and log files have a ".p" extension. For example, if your tree file is named "mytrees.t", RWTY will try to find "mytrees.p" in the same directory, and will insert the file into the trees object automatically if it's found. RWTY can also automatically find files for BEAST and *BEAST runs, so long as you provide the appropriate format command.

If you have a large number of trees and RWTY is taking too long to produce plots, you can subsample your trees with the "trim" command. For instance, "trim = 10" means RWTY will only keep every tenth tree for analysis

my.trees <- load.trees("PATH")

my.beast.trees <- load.trees("PATH", format = "beast", trim = 5)

my.starbeast.trees <- load.trees("PATH", format = "*beast")

If your trees or log files are named or formatted in a nonstandard way, you can still load your data into an RWTY object. The "logfile" argument takes a path to your logfile, the "type" argument tells RWTY whether your trees are Nexus or Newick format, the "skip" argument tells RWTY how many lines to skip in the log file before it reaches the header line, and the "gens.per.tree" argument tells RWTY how many generations there were in the MCMC inbetween each tree that was sampled.

my.trees <- load.trees("path_to_tree_file", type = "nexus", logfile = "path_to_log_file", skip = 1, gens.per.tree = 1)

For example, if you have a collection of trees in Newick format like this:

((A, B),(C, D));
(((A, B), C), D);
((A, C),(B, D));

and no logfile, you could load them into RWTY like this (note you'd need to change the file path to suit your machine):

my.trees <- load.trees("~/Desktop/trees.phy", type = "newick", gens.per.tree = 1)

`load.multi()`

This function is used to load more than one chain into a list of rwty.chains objects. It allows you to specify the format of your MCMC analysis, so that it can automatically find tree and log files. You provide a path to the folder containing the tree files, as well as the file format (default is MrBayes). As with load.trees, load.multi will attempt to automatically find log files. For example, to search a directory named PATH for output of a MrBayes run, you would input the following:

my.trees <- load.multi("PATH", format = "mb")

`analyze.rwty()`

This function conducts a suite of analyses that are applicable to the data that are passed to it. The output from this function will depend on whether you have loaded data from one or more than one chain, and on whether your data comprises just phylogenetic trees, or whether it also includes continuous parameters from log files. Since RWTY runs can take a while, and since a single RWTY run produces a number of plots and metrics, it's a good idea to always store them in an object for later use.

analyze.rwty() will set sensible defaults for most parameters, but you should set the burnin and the fill.color to values that are appropriate for your data. The burnin should be set to a value that removes trees and parameters not sampled from the stationary distribution, and the fill.color should be set to the name of the parameter from your log file with which you would like to colour the points in tree space.

analyze.rwty() returns a long list of plots.

data(salamanders)
salamanders.rwty <- analyze.rwty(salamanders, burnin=50, fill.color = 'LnL')

# to see which plots you have
names(salamanders.rwty)

`makeplot.x()`

Each of the functions that starts with makeplot. has a similar set of inputs. Each plot takes a list of one or more rwty.chain objects and the burnin as input, with other parameters that are specific to certain plots. These functions are useful to tune the parameters of particular plots (e.g. the burnin, or the number of points to plot in tree space plots), which can then be passed to a more thorough analysis through the additional arguments in analyze.rwty().

For example, you might choose your burnin based on looking at traces of parameters and trees.

makeplot.all.params(salamanders, burnin=0) # the LnL trace suggests burnin should be >0

makeplot.all.params(salamanders, burnin=50) # this looks OK

`topological.approx.ess()`

This function calculates the approximate ESS of tree topologies from all of the chains. It returns a data frame with approximate ESS values, which can be used to decide if the chain has been run for long enough for sufficient samples to be collected. The approximate ESS is mathematically similar to the ESS used for continuous parameters, but in certain cases an upper bound will be provided rather than a point estimate. These are represented by '<', and '=' in the 'operator' column of the data frame respectively.

approx.ess <- topological.approx.ess(salamanders, burnin = 50)

Which gives you a data frame that looks like this:

  operator approx.ess       chain
1        =   243.0683 AMOTL2.run1
2        =   134.2827 AMOTL2.run2
3        =   222.0624   LHX2.run1
4        =   251.0000   LHX2.run2
5        =   251.0000  TRMT5.run1
6        =   251.0000  TRMT5.run2

RWTY outputs

RWTY outputs a number of useful plots for exploring the performance of MCMC chains.

Parameter trace plots

If a logfile is provided, RWTY will produce a trace plot of the value of each column in the log file as a function of position in the chain. If you have multiple chains, the plots will be faceted by chain (you can turn this off using facet = FALSE when calling analyze.rwty(). The title of each plot shows the Effective Sample Size (ESS) of the parameter. In general, you should aim to have ESS > 200 for all parameters.

salamanders.rwty$pi.C

parameter plot parameter
plot

Parameter correlation plots

If a logfile is provided, RWTY will produce a scatter plot matrix from the columns of the log file, along with the topological distance from the starting tree. Because the log file output of many MCMC programs can contain a large number of columns, the default behavior of makeplot.pairs is to only plot the topological distance along with the first two data columns. In the density plots on the diagonal, red values indicate values outside the 95% CI. In the scat

RWTY

Install / Use

README