AdmixtureBayes

Purpose

AdmixtureBayes is a program to generate, analyze, and plot posterior samples of admixture graphs (phylogenies incorporating admixture events) given an allele count file. AdmixtureBayes is currently maintained by Andrew Vaughn. Please report any strange results, errors, or code suggestions to him at ahv36@berkeley.edu. Please see https://doi.org/10.1371/journal.pgen.1010410 for our paper describing AdmixtureBayes.

Installation

AdmixtureBayes can be downloaded from this GitHub repo by running the following command:

$ git clone https://github.com/avaughn271/AdmixtureBayes

AdmixtureBayes is written in Python and requires the following Python packages:

"numpy", "scipy", "pandas", "pathos", "graphviz"

See the following links for installation help:

https://numpy.org/install/

https://scipy.org/install/

https://pandas.pydata.org/docs/getting_started/install.html

https://pypi.org/project/pathos/

https://pypi.org/project/graphviz/

(Note: in recent versions of pandas, a warning about a deprecated feature can show up, although the function should remain unchanged. AdmixtureBayes was tested on pandas version 1.4.1). Furthermore, if you wish to use the given R script to evaluate convergence, then you need to also have R installed with the coda package, which can be installed by running:

install.packages("coda")

in any R session.

Example commands

A script containing example commands is found in the example folder together with a test dataset.

Input file

The input for AdmixtureBayes is an allele count file in the exact same format as used by TreeMix.

s1 s2 s3 s4 out
9,11 13,7 11,9 14,6 14,6
4,16 4,16 0,20 1,19 2,18
...
1,19 2,18 3,17 2,18 2,18

where the first line is the populations and the subsequent lines are the bi-allelic counts in each population for a number of SNPs. The first and second allele type has no meaning and can be chosen arbitrarily. The population names should only include letters and numbers (no spaces, dashes, underscores, etc.). See the R script "ConvertFromVCF.R" in the "example" folder for a template for converting from VCF files to this input. At minimum, you will need to change the name of the input VCF file and the individual-to-population mapping in this script. Keep in mind that VCF files can be quite complex, and therefore this script may not work for all possible input VCF files. The user should always perform a sanity check between the input and output of this step and should not take the output at face value.

Notes on Missing Data: Missing data for a population at a particular site should be encoded as "0,0". AdmixtureBayes estimates allelic covariance matrices, one large matrix consisting of all SNPs and one for each bootstrap sample of adjacent SNPs to be used in the estimation of the Wishart degrees of freedom. Entry (i,j) in these matrices is the covariance in allele frequencies between populations i and j using all relevant SNPs (either all SNPs or the SNPs in the corresponding bootstrap block). If there is no missing data in either of these populations, then this is the dot product of the allele frequency vectors at the relevant SNPs for populations i and j divided by the total number of relevant SNPs. If a site has missing data for either i or j, then it is not included in the dot product, and we instead divide by the total number of relevant SNPs that are missing in neither i nor j. Missing data is not a problem for AdmixtureBayes to handle, but it does violate the assumption of even sampling imposed by the Wishart distribution. Relevant warnings for missing data will be printed to the console, although the algorithm will still run the MCMC properly.

Running AdmixtureBayes

AdmixtureBayes has 3 steps:

(1) runMCMC - this takes the input of allele counts described above, runs the MCMC chain, and generates a set of samples of admixture graphs

(2) analyzeSamples - this takes the output of the previous step and performs a burn-in and thinning step to generate independent samples of admixture graphs

(3) makePlots - this takes the output of the previous step and generates different plots that are useful for interpretation

Notes on Runtime: Steps 2 and 3 should be very fast, regardless of the input. The runtime of step 1 is determined by how many iterations are run and the number of populations being considered. Increasing the number of populations will result in more time per iteration. Increasing the number of populations also increases the size of the state space of admixture graphs, resulting in more iterations being necessary to achieve convergence. Runtime is invariant with respect to the number of individuals in each population. Increasing the number of SNPs will keep the time per iteration the same, but will result in more time for the initial step of calculating the allele covariance matrix. Asymptotically, as the number of iterations increases, this will be a negligible fraction of the total runtime. Keep in mind that runtime necessary to achieve convergece increases exponentially with the number of populations due to the dramatic increase in the state space and the number of possible proposals per state. While AdmixtureBayes should be able to easily converge on 4 or 5 non-outgroup populations within 1 hour on a desktop, one can expect a thorough analysis on 10 populations to take dozens of hours. Usage of a computing cluster is recommended for datasets with many populations.

(1) runMCMC

In this step, we run the MCMC chain that explores the space of admixture graphs. The script to run is

$ python PATH/AdmixtureBayes/admixturebayes/runMCMC.py

This step takes as input:

--input_file The input file of allele counts as described above.

--outgroup The name of the population that will serve as the outgroup. For example, in the above file, "out" could be the outgroup.

--n (optional) The number of iterations the MCMC sampler should make. (Technically, this is the number of MC3 flips the chain should make, which is directly proportional to the number iterations. The exact number of iterations is 50*n). Default value is 200. This number should be increased in all practical applications.

--result_file (optional) The name of the mcmc output file of this step. No file extension is added (meaning entering "example" will produce "example" as an output file, not "example.txt" or "example.csv".). Default value is "mcmc_samples.csv"

--continue_samples (optional) This is used if you want to continue a previous AdmixtureBayes run, for example if convergence was not yet reached. The argument passed to --continue_samples should be the file name of the MCMC sample file produced by a previous AdmixtureBayes run (which is "mcmc_samples.csv" by default). A new file will be produced (whose name will be whatever the input to --result_file is in this call to AdmixtureBayes) that will contain all of the samples of the previous run in addition to all of the samples from this run. This call to AdmixtureBayes will start the chain in the last state of the previous AdmixtureBayes run. The previous output file will not be overwritten. The name of the --result_file argument used in this call to AdmixtureBayes should be different than the one produced by the previous call to avoid unwanted behavior. For example, if a user does not specify --result_file for either call, then both will be "mcmc_samples.csv" by default and unwanted behavior will occur. If this argument is not specified, then the algorithm will start at a randomly constructed graph, which may be useful for monitoring mixing and convergence of the chain, for example by Gelman-Rubin statistics.

--verbose_level (optional) Either "normal" or "silent". If "normal", then the total number of snps will be printed to the console along with the progress of the MCMC sampler. Every 1000th iteration, the progress towards the total number of iterations is printed. If "silent", then nothing will be printed to the console. Default value is "normal."

--save_covariance (optional) If this flag is specified, then the allelic covariance matrix produced by considering all SNPs will be saved to the file "covariance_matrix.txt" in the current working directory. Note that this will be the covariance matrix described in the AdmixtureBayes paper, which is to say a scaled, bias-corrected transformation of the naive covariance matrix that would be suggested by Equation 4 of the main text. The user may use the input data to compute the naive covariance matrix using Equation 4 of the main text should they choose, but bear in mind that this is different than the covariance matrix AdmixtureBayes is actually using.

--MCMC_chains (optional) A tuning parameter of the MCMC algorithm. The number of chains to run the MC3 with (See Matthew Darlington's great explanation of MC3 here). More chains should result in better mixing at the cost of increased computational time. AdmixtureBayes supports multiprocessing, so ideally this would be the number of cores. Default value is 8. Must be at least 2. See the section below on MC3 Mixing for more details.

--maxtemp (optional) A tuning parameter of the MCMC algorithm. The temperature of the hottest chain in the MC3 algorithm. Must be a positive number, though not necessarily an integer. The temperature of the $i$'th chain will be $maxtemp^{[(i-1)/(MCMC\textunderscore chains-1)]^{spacing}}$. We propose swaps between chains of adjacent temperature,

AdmixtureBayes

Install / Use

README

AdmixtureBayes

Purpose

Installation

Example commands

Input file

Running AdmixtureBayes

(1) runMCMC

This step takes as input: