Pyrho
Fast inference of fine-scale recombination rates based on fused-LASSO
Install / Use
/learn @popgenmethods/PyrhoREADME
pyrho
Fast demography-aware inference of fine-scale recombination rates based on fused-LASSO
Table of Contents
NB: in version 0.1.0 we switched everything from coalescent units to natural scale -- that is inputs should now be in terms of Ne, generations, and per-base, per-generation mutation rates. The output is now also automatically scaled to be the per-generation recombination rate.
Human Recombination Maps
We have inferred recombination maps for each of the 26 populations in phase 3 of the 1000 Genomes Project. Those maps are available at this link including maps inferred for the two most recent genome builds hg19/GRCh37 and hg38/GRCh38.
The recombination maps for hg38/GRCh38 are now also available for simulations using the wonderful stdpopsim package.
Human Population Sizes
When making the recombination maps for the 1000 Genomes Project populations we used
smc++
to infer population size histories for each population.
Those size histories are plotted in
Figure 2 of the pyrho paper,
and the data used to make those plots are available in smcpp_popsizes_1kg.csv.
The x column is time in years
assuming a generation time of 29 years,
and the y column
contains the population size in Ne.
Installation
pyrho is compatible with both python 2 and python 3. pyrho makes use of a number of external packages including the excellent numba, msprime, and cyvcf2 packages. As such it is recommended to install pyrho in a virtual environment.
If using conda this can be accomplished by running
conda create -n my-pyrho-env
conda activate my-pyrho-env
or using
virtualenv my-pyrho-env
source my-pyrho-env/bin/activate
Once you have set up and activated your virtual environment, you must first install ldpop. Change to a directory where you want to store ldpop and then run
git clone https://github.com/popgenmethods/ldpop.git ldpop
pip install ldpop/
Note that this will create a directory ldpop.
pyrho makes use of
msprime,
which requires
gsl
and
hdf.
pyrho also has a dependency on
openssl.
If you do not have these installed, these can be installed using
apt-get, yum, conda, brew etc...
For example, to install openssl on Ubuntu run:
sudo apt-get install libssl-dev
You will also need to have cython and pytables. If you do not yet have them installed, run
pip install cython pytables
You should be able to then just clone and install pyrho by running
git clone https://github.com/popgenmethods/pyrho.git pyrho
pip install pyrho/
You can check that everything is running smoothly by running
python -m pytest pyrho/tests/tests.py
NB: the first time you run pyrho, numba will compile and cache a number of functions, which can take up to ~30 seconds.
Usage
pyrho has a command line interface and consists of a number of separate commands. A typical workflow is to first use make_table to build a lookup table and then use hyperparam to find reasonable hyperparameter settings for your data. Finally, use optimize to infer a fine-scale recombination map from data. There is also a command, compute_r2, that computes statistics of the theoretical distribution of r<sup>2</sup>.
make_table
Before performing any inference, you must first compute a lookup table. A standard use case would be
pyrho make_table --samplesize <n> --approx --moran_pop_size <N> \
--numthreads <par> --mu <mu> --outfile <outfile> \
--popsizes <size1>,<size2>,<size3> --epochtimes <breakpoint1>,<breakpoint2>
which indicates that we should compute a lookup table for a sample of size
<n>
, from a population where at present the effective size in number of diploids is
<size1>
, at
<breakpoint1>
generations in the past the size was
<size2>
diploids and so on, with a per-generation
mutation rate
<mu>
The --numthreads option tells pyrho to use
<par>
processors
when computing the lookup table. Finally, --approx with --moran_pop_size
tells pyrho to compute an approximate lookup table for a larger sample size
<N>
and then downsample to a table for size
<n>
. In general, we have found that while setting
<N>
about 25-50% larger than
<n>
results in accuracy indistinguishable from using the exact model,
this added accuracy does not generally impact the inferred recombination rates.
As such, we recommend setting <N> equal to <n>.
Without using the --approx flag, pyrho can
compute lookup tables for
<n>
< ~50, whereas with the --approx flag, pyrho
can handle sample sizes in the hundreds (e.g.,
<n>
= 200,
<N>
= 200) with little loss in accuracy for the inferred recombination rates.
If the two-locus likelihoods need to be extremely accurate, then it might be
preferable to set e.g., <n> = 200 and <N> = 256.
In v0.2.0 we altered the default behavior in computing likelihoods,
using a slightly more lax convergence criteria in the numerics. This results
in significantly faster runtimes at essentially no loss in accuracy,
but if highly accurate two-locus likelihoods are required, the old behavior
can be recovered using --high_accuracy_stationary.
The output is an hdf format table containing all of the pre-computed likelihoods needed to run hyperparam, optimize, and compute_r2.
Note that make_table used to conssume significant amounts of memory, but that
has been improved, e.g., N=210 requires
about 20G of RAM using the --approx flag.
To see a full list of options and their meaning, run
pyrho make_table --help.
hyperparam
After computing a lookup table with make_table, it is a good idea to find reasonable settings for the main hyperparameters of pyrho: the window size and the smoothness penalty. This command simulates data and performs optimization under a number of different hyperparameter settings and outputs the accuracy in terms of a number of different metrics. A typical usage is
pyrho hyperparam --samplesize <n> --tablefile <make_table_output> \
--mu <mu> --ploidy <ploidy> \
--popsizes <size1>,<size2>,<size3> --epochtimes <breakpoint1>,<breakpoint2> \
--outfile <output_file>
where <n> is your haploid sample size, <make_table_output> is
the output of running make_table, and
<mu>, <size1>..., and <breakpoint1>... are as above.
<ploidy> should be set to 1 if using phased data and 2 for
unphased genotype data. Ploidies other than 1 or 2 are not
currently supported.
The output is a table containing various measures of accuracy of the inferred maps compared to the recombination maps used in the simulations (drawn from the hapmap recombination map). The optimal hyperparameters will depend on your application -- it may be important to maximize a particular measure of correlation or to minimize the L2 norm between the true and inferred maps. In any case, you can use the results in the output table to choose the hyperparameters for running optimize.
To see a full list of options and their meaning, run
pyrho hyperparam --help.
optimize
After computing a lookup table using make_table and choosing reasonable hyperparameters (optionally using hyperparam) you are ready to infer recombination maps from real data.
pyrho supports data in fasta format and LDhat's sites and locs format, but it is easiest to directly use VCF formatted data. pyrho supports VCF, bgzipped VCF, and BCF format files. If using LDhat's formats see the note about using sites and locs.
A typical usage is
pyrho optimize --vcffile <data> --windowsize <w> --blockpenalty <bpen> \
--tablefile <make_table_output> --ploidy <ploidy> --outfile <output_file> \
--numthreads <par>
with <data> being a VCF, bgzipped VCF, or BCF file containing a
single chromosome, <w> and
<bpen> being hyperparameters chosen using hyperparam,
and <ploidy> should be 1 for phased data and 2 for unphased
data.
The output file has three columns -- the first column is the zero-indexed start of an interval, the second column is the end of the interval (non-inclusive), and the third column is r, which is the per-base per-generation recombination rate in that interval.
To see a full list of options and their meaning, run
pyrho optimize --help.
pyrho optimize can be slow if there is a lot of missing data.
Consider removing SNPs with more than a few missing genotypes
to increase the speed. Another option is to use the
--fast_missing option. This option uses a bit more memory,
and throws away some data, but can be substantially faster
for datasets with lots of missing data, and does not require
pre-filtering SNPs for missingness.
A note about LDhat format input
The preferred input format for pyrho is a VCF fo
Related Skills
node-connect
341.2kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
84.5kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
341.2kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
84.5kCommit, push, and open a PR
