ReLERNN

Recombination Landscape Estimation using Recurrent Neural Networks

Generate Convert Improve

Install / Use

/learn @kr-colab/ReLERNN

About this skill

Quality Score

0/100

README

ReLERNN

Recombination Landscape Estimation using Recurrent Neural Networks

====================================================================

ReLERNN uses deep learning to infer the genome-wide landscape of recombination from as few as four individually sequenced chromosomes, or from allele frequencies inferred by pooled sequencing. This repository contains the code and instructions required to run ReLERNN, and includes example files to ensure everything is working properly. The manuscript detailing ReLERNN can be found here.

Recommended installation on linux

ReLERNN requires a CUDA-enabled NVIDIA GPU, Python 3.10+, and TensorFlow 2.18+.

pixi handles Python, CUDA, and all dependencies in one step:

$ git clone https://github.com/kr-colab/ReLERNN.git
$ cd ReLERNN
$ pixi install

This creates an environment with GPU support by default. If you are using on an HPC cluster, you might need to set the CUDA override before installing. Do this by running export CONDA_OVERRIDE_CUDA=12.8 before pixi install (replace 12.8 with your CUDA version).

For CPU-only use:

$ pixi install -e cpu

To use ReLERNN, either activate the environment or prefix commands with pixi run:

# Option 1: activate the environment, then run commands directly
$ pixi shell
$ ReLERNN_SIMULATE --vcf input.vcf --genome genome.bed --projectDir ./output/ ...

# Option 2: prefix each command with pixi run
$ pixi run ReLERNN_SIMULATE --vcf input.vcf --genome genome.bed --projectDir ./output/ ...

All examples below show bare commands and assume an activated environment.

Testing ReLERNN

An example VCF file (5 contigs; 10 haploid chromosomes) and a shell script for running ReLERNN's four modules is located in $/ReLERNN/examples. To test the functionality of ReLERNN simply use the following commands:

$ pixi run example

Provided everything worked as planned, $ReLERNN/examples/example_output/ should be populated with a few directories along with the files: example.PREDICT.txt and example.PREDICT.BSCORRECT.txt. The latter is the finalized output file with your recombination rate predictions and estimates of uncertainty.

The above example took 57 seconds to complete on a Xeon machine using four CPUs and one NVIDIA 2070 GPU. Note that the parameters used for this example were designed only to test the success of the installation, not to make accurate predictions. Please use the guidelines below for the best results when analyzing real data.

You can now test the functionality of ReLERNN for use with pool-seq data by using the following commands:

$ pixi run example-pool

Estimating a recombination landscape from individually sequenced chromosomes

The ReLERNN pipeline is executed using four commands: ReLERNN_SIMULATE, ReLERNN_TRAIN, ReLERNN_PREDICT, and the optional ReLERNN_BSCORRECT (see the Method flow diagram).

Before running ReLERNN

ReLERNN takes as input a VCF file of biallelic variants. Users should use appropriate QC techniques (filtering low-quality variants, etc.) and remove non-biallelic variants before running ReLERNN. Small contigs (<< 250 SNPs) should not be included in the genome file --genome, though these do not need to be removed from the VCF. ReLERNN also requires that the number of sampled chromosomes is identical across all contigs, and VCFs should be filtered accordingly. Hemizygous chromosomes or haploid samples in an otherwise diploid dataset should ideally be run separately using a separate VCF. It is possible to treat hemizygous chromosomes as "diploids with missing data" using the --forceDiploid option, however this is not recommended. It is now possible to run ReLERNN on VCFs with missing genotypes (coded as a .).

If you want to make predictions based on equilibrium simulations, you can skip ahead to executing ReLERNN_SIMULATE. While ReLERNN is generally robust to demographic model misspecification, prediction accuracy may potentially be improved by simulating the training set under a demographic history that accurately matches that of your sample. ReLERNN optionally takes the output files from three popular demographic history inference programs (stairwayplot_v1, SMC++, and MSMC), and simulates a training set under these histories. Note: for SMC++ use the .csv output (option -c in SMC++). It is up to the user to perform the proper due diligence to ensure that the population size histories reported by these programs are sound. In our opinion, unless you know exactly how these programs work and you expect your data to represent a history dramatically different from equilibrium, you are better off skipping this step and training ReLERNN on equilibrium simulations. Once you have run one of the demographic history inference programs listed above, you simply provide the raw output file from that program to ReLERNN_SIMULATE using the --demographicHistory option.

Step 1) ReLERNN_SIMULATE

ReLERNN_SIMULATE reads your VCF file and splits it by chromosome. The chromosomes to be evaluated must be specified by providing a BED file of said positions using the --genome argument. A BED-formatted accessibility mask (with non-overlapping ascending windows) may be optionally provided using the --mask option. Use the --phased or --unphased flag to train using phased or unphased genotypes (the default is unphased). It is required that the VCF file use the extension .vcf. The prefix of that file will serve as the prefix used for all output files (e.g. running ReLERNN on the file population7.vcf will generate the result file population7.PREDICT.txt). It is strongly recommended that you use the default setting for --maxWinSize, larger values can cause training to fail and smaller values can result in lower accuracy. Users are required to provide an estimate of the per-base mutation rate for your sample, along with an estimate for generation time (in years). If you previously ran one of the demographic history inference programs listed above, just use the same values that you used for them. This is also where you will point to the output from said program, using --demographicHistory. If you are not simulating under an inferred history, simply do not include this option. Importantly, you can also set a value for the maximum recombination rate to be simulated using --upperRhoThetaRatio. If you have an a priori estimate for an upper bound to the ratio of rho to theta go ahead and set this here. Keep in mind that higher values will dramatically slow the coalescent simulations. We recommend using the default number of train/test/validation simulation examples, but if you want to simulate more examples, go right ahead. ReLERNN_SIMULATE then uses msprime to simulate 100k training examples and 1k validation and test examples. All output files will be generated in subdirectories within the path provided to --projectDir. It is required that you use the same projectDir for all four ReLERNN commands. If you want to run ReLERNN of multiple populations/taxa, you can run them independently using a unique projectDir for each. This step is simulation heavy and runtimes will strongly depend on the inferred population size.

The complete list of arguments used in ReLERNN_SIMULATE is found below:

ReLERNN_SIMULATE -h

usage: ReLERNN_SIMULATE [-h] [-v VCF] [-g GENOME] [-m MASK] [-d OUTDIR]
                        [-n DEM] [-u MU] [-l GENTIME] [-r UPRTR] [-t NCPU] [-s SEED]
                        [--phased] [--unphased] [--forceDiploid] [--phaseError PHASEERROR]
                        [--maxWinSize WINSIZEMX] [--maskThresh MASKTHRESH]
                        [--nTrain NTRAIN] [--nVali NVALI] [--nTest NTEST]

optional arguments:
  -h, --help            show this help message and exit
  -v VCF, --vcf VCF     Filtered and QC-checked VCF file. Important: Every row
                        must correspond to a biallelic SNP with no missing
                        data!
  -g GENOME, --genome GENOME
                        BED-formatted (i.e. zero-based) file corresponding to
                        chromosomes and positions to consider
  -m MASK, --mask MASK  BED-formatted file corresponding to inaccessible bases
  -d OUTDIR, --projectDir OUTDIR
                        Directory for all project output. NOTE: the same
                        projectDir must be used for all functions of ReLERNN
  -n DEM, --demographicHistory DEM
                        Output file from either stairwayplot, SMC++, or MSMC
  -u MU, --assumedMu MU
                        Assumed per-base mutation rate
  -l GENTIME, --assumedGenTime GENTIME
                        Assumed generation time (in years)
  -r UPRTR, --upperRhoThetaRatio UPRTR
                        Assumed upper bound for the ratio of rho to theta
  -t NCPU, --nCPU NCPU  Number of CPUs to use (defaults to total available cores)
  -s SEED, --seed SEED  Random seed
  --phased              VCF file is phased
  --unphased            VCF file is unphased
  --forceDiploid        Treats all samples as diploids
                        with missing data (bad idea; see README)
  --phaseError PHASEERROR
                        Fraction of bases simulated with incorrect phasing
  --maxWinSize WINSIZEMX
                        Max number of sites per window to train on. Important:
                        too many sites causes problems in training
  --maskThresh MASKTHRESH
                        Discard windows where >= maskThresh percent of sites
                        are inaccessible
  --nTrain NTRAIN       Number of training examples to simulate
  --nVali NVALI         Number of validation examples to simulate
  --nTest NTEST         Number

Related Skills

YC-Killer

2.7k

A library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.

best-practices-researcher

The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app

groundhog

398

Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).

isf-agent

a repo for an agent that helps researchers apply for isf funding