pyrho

Fast demography-aware inference of fine-scale recombination rates based on fused-LASSO

Human Recombination Maps
Human Population Sizes
Installation
Usage
Example
FAQ
Citation

NB: in version 0.1.0 we switched everything from coalescent units to natural scale -- that is inputs should now be in terms of Ne, generations, and per-base, per-generation mutation rates. The output is now also automatically scaled to be the per-generation recombination rate.

Human Recombination Maps

We have inferred recombination maps for each of the 26 populations in phase 3 of the 1000 Genomes Project. Those maps are available at this link including maps inferred for the two most recent genome builds hg19/GRCh37 and hg38/GRCh38.

The recombination maps for hg38/GRCh38 are now also available for simulations using the wonderful stdpopsim package.

Human Population Sizes

When making the recombination maps for the 1000 Genomes Project populations we used smc++ to infer population size histories for each population. Those size histories are plotted in Figure 2 of the pyrho paper, and the data used to make those plots are available in smcpp_popsizes_1kg.csv. The x column is time in years assuming a generation time of 29 years, and the y column contains the population size in Ne.

Installation

pyrho is compatible with both python 2 and python 3. pyrho makes use of a number of external packages including the excellent numba, msprime, and cyvcf2 packages. As such it is recommended to install pyrho in a virtual environment.

If using conda this can be accomplished by running

conda create -n my-pyrho-env
conda activate my-pyrho-env

or using

virtualenv my-pyrho-env
source my-pyrho-env/bin/activate

Once you have set up and activated your virtual environment, you must first install ldpop. Change to a directory where you want to store ldpop and then run

git clone https://github.com/popgenmethods/ldpop.git ldpop
pip install ldpop/

Note that this will create a directory ldpop.

pyrho makes use of msprime, which requires gsl and hdf. pyrho also has a dependency on openssl. If you do not have these installed, these can be installed using apt-get, yum, conda, brew etc...

For example, to install openssl on Ubuntu run:

sudo apt-get install libssl-dev

You will also need to have cython and pytables. If you do not yet have them installed, run

pip install cython pytables

You should be able to then just clone and install pyrho by running

git clone https://github.com/popgenmethods/pyrho.git pyrho
pip install pyrho/

You can check that everything is running smoothly by running

python -m pytest pyrho/tests/tests.py

NB: the first time you run pyrho, numba will compile and cache a number of functions, which can take up to ~30 seconds.

Usage

pyrho has a command line interface and consists of a number of separate commands. A typical workflow is to first use make_table to build a lookup table and then use hyperparam to find reasonable hyperparameter settings for your data. Finally, use optimize to infer a fine-scale recombination map from data. There is also a command, compute_r2, that computes statistics of the theoretical distribution of r<sup>2</sup>.

make_table

Before performing any inference, you must first compute a lookup table. A standard use case would be

pyrho make_table --samplesize <n> --approx  --moran_pop_size <N> \
--numthreads <par> --mu <mu> --outfile <outfile> \
--popsizes <size1>,<size2>,<size3> --epochtimes <breakpoint1>,<breakpoint2>

which indicates that we should compute a lookup table for a sample of size <n> , from a population where at present the effective size in number of diploids is <size1> , at <breakpoint1> generations in the past the size was <size2> diploids and so on, with a per-generation mutation rate <mu> The --numthreads option tells pyrho to use <par> processors when computing the lookup table. Finally, --approx with --moran_pop_size tells pyrho to compute an approximate lookup table for a larger sample size <N> and then downsample to a table for size <n> . In general, we have found that while setting <N> about 25-50% larger than <n> results in accuracy indistinguishable from using the exact model, this added accuracy does not generally impact the inferred recombination rates. As such, we recommend setting <N> equal to <n>. Without using the --approx flag, pyrho can compute lookup tables for <n> < ~50, whereas with the --approx flag, pyrho can handle sample sizes in the hundreds (e.g., <n> = 200, <N> = 200) with little loss in accuracy for the inferred recombination rates. If the two-locus likelihoods need to be extremely accurate, then it might be preferable to set e.g., <n> = 200 and <N> = 256.

In v0.2.0 we altered the default behavior in computing likelihoods, using a slightly more lax convergence criteria in the numerics. This results in significantly faster runtimes at essentially no loss in accuracy, but if highly accurate two-locus likelihoods are required, the old behavior can be recovered using --high_accuracy_stationary.

The output is an hdf format table containing all of the pre-computed likelihoods needed to run hyperparam, optimize, and compute_r2.

Note that make_table used to conssume significant amounts of memory, but that has been improved, e.g., N=210 requires about 20G of RAM using the --approx flag.

To see a full list of options and their meaning, run pyrho make_table --help.

hyperparam

After computing a lookup table with make_table, it is a good idea to find reasonable settings for the main hyperparameters of pyrho: the window size and the smoothness penalty. This command simulates data and performs optimization under a number of different hyperparameter settings and outputs the accuracy in terms of a number of different metrics. A typical usage is

pyrho hyperparam --samplesize <n> --tablefile <make_table_output> \
--mu <mu> --ploidy <ploidy> \
--popsizes <size1>,<size2>,<size3> --epochtimes <breakpoint1>,<breakpoint2>  \
--outfile <output_file>

where <n> is your haploid sample size, <make_table_output> is the output of running make_table, and <mu>, <size1>..., and <breakpoint1>... are as above. <ploidy> should be set to 1 if using phased data and 2 for unphased genotype data. Ploidies other than 1 or 2 are not currently supported.

The output is a table containing various measures of accuracy of the inferred maps compared to the recombination maps used in the simulations (drawn from the hapmap recombination map). The optimal hyperparameters will depend on your application -- it may be important to maximize a particular measure of correlation or to minimize the L2 norm between the true and inferred maps. In any case, you can use the results in the output table to choose the hyperparameters for running optimize.

To see a full list of options and their meaning, run pyrho hyperparam --help.

optimize

After computing a lookup table using make_table and choosing reasonable hyperparameters (optionally using hyperparam) you are ready to infer recombination maps from real data.

pyrho supports data in fasta format and LDhat's sites and locs format, but it is easiest to directly use VCF formatted data. pyrho supports VCF, bgzipped VCF, and BCF format files. If using LDhat's formats see the note about using sites and locs.

A typical usage is

pyrho optimize --vcffile <data> --windowsize <w> --blockpenalty <bpen> \
--tablefile <make_table_output> --ploidy <ploidy> --outfile <output_file> \
--numthreads <par>

with <data> being a VCF, bgzipped VCF, or BCF file containing a single chromosome, <w> and <bpen> being hyperparameters chosen using hyperparam, and <ploidy> should be 1 for phased data and 2 for unphased data.

The output file has three columns -- the first column is the zero-indexed start of an interval, the second column is the end of the interval (non-inclusive), and the third column is r, which is the per-base per-generation recombination rate in that interval.

To see a full list of options and their meaning, run pyrho optimize --help.

pyrho optimize can be slow if there is a lot of missing data. Consider removing SNPs with more than a few missing genotypes to increase the speed. Another option is to use the --fast_missing option. This option uses a bit more memory, and throws away some data, but can be substantially faster for datasets with lots of missing data, and does not require pre-filtering SNPs for missingness.

A note about LDhat format input

The preferred input format for pyrho is a VCF fo

Pyrho

Install / Use

README