SkillAgentSearch skills...

DisperseNN

Estimating dispersal using deep neural networks

Install / Use

/learn @kr-colab/DisperseNN
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

Be sure to check out our newer method, disperseNN2.

disperseNN

disperseNN is a machine learning framework for predicting σ, the expected per generation displacement distance between offspring and their parent(s), from genetic variation data. disperseNN uses training data generated from simulations. See Smith et al. for a comprehensive description of the method.

Install requirements

First clone this repository:

git clone https://github.com/chriscrsmith/disperseNN.git
cd disperseNN/

Next create a conda environment using our provided yaml file. This will create a virtual environment, and then install all the dependencies.

mamba env create -f conda_env.yml

After conda has done its thing, activate the new environment to use disperseNN

conda activate disperseNN

Overview

disperseNN has two modes:

1. Prediction

Command line flag: --predict

Input options:

  • VCF
  • tree sequences
  • pre-processed numpy arrays

2. Training

Command line flag: --train

Input options:

  • tree sequences
  • pre-processed numpy arrays

Within each mode- predict or train- you may specify different types of input data, each requiring its own set of additional command line parameters. More details are below.

The pre-trained model

While disperseNN can be trained from scratch, we recommend trying the pre-trained model provided in this repository first (see Smith et al. for model details). Before handing an empirical VCF to disperseNN, it should undergo basic filtering steps to remove non-variant sites and indels; rare variants should be left in. Furthermore, the VCF should include only the individuals that you intend to analyze. At least 5,000 SNPs are required for the current model, but more than 5,000 SNPs can be left in the VCF because disperseNN will draw a random subset. Last, a .locs file should be prepared with two columns corresponding to the latitude and longitude of each individual.

To run the example commands, begin by setting up a new working directory:

mkdir -p temp_wd/

Below is an example command for estimating σ from a VCF file using a pre-trained model (should take <30s to run).

python disperseNN.py \
  --predict \
  --load_weights Saved_models/pretrained082522_model.hdf5 \
  --training_params Saved_models/pretrained082522_training_params.npy \
  --empirical Examples/VCFs/halibut \
  --num_reps 10 \
  --out temp_wd/out_vcf \
  --seed 12345

Explanation of command line values:

  • load_weights: this specifies the path to the saved model. The above command points to a particular pre-trained model, pretrained082522_model.hdf5, but instructions for how to train a new model from scratch are described below.
  • training_params: a single file with several parameters used to train the above model. Included are: the mean and standard deviation used to normalize the training targets, and the number of SNPs and maximum sample size used to train the above model. In this case, the number of SNPs is 5,000 and the max sample size is 100.
  • empirical: this flag is specific to analyzing empirical VCFs. Give it the shared prefix for the .vcf and .locs files (i.e. without '.vcf' or '.locs')
  • num_reps: number of repeated draws (of 5,000 SNPs in this case) from the VCF, with a single prediction per subset.
  • out: output prefix.
  • seed: random number seed.

In addition to printing information about the model architecture to standard output, this command will also create a new file, temp_wd/out_vcf_predictions.txt, containing:

Examples/VCFs/halibut_0 4.008712555
Examples/VCFs/halibut_1 2.7873949732
Examples/VCFs/halibut_2 3.7759448146
Examples/VCFs/halibut_3 3.2785118587
Examples/VCFs/halibut_4 2.6940501913
Examples/VCFs/halibut_5 2.8515263298
Examples/VCFs/halibut_6 3.1886536211
Examples/VCFs/halibut_7 2.5544670147
Examples/VCFs/halibut_8 2.7795463315
Examples/VCFs/halibut_9 3.9511181921

Where each line is one of the 10 predictions of σ using a random subset of 5K SNPs.

Instructions with example commands

Simulation

The pre-trained model provided may not be appropriate for your data. In this case, it is possible to train a new model from scratch from a simulated training set. We use the SLiM recipe SLiM_recipes/bat20.slim to generate training data (tree sequences). The model is adapted from Battey et al. (2020), but certain model parameters are specified on the command line.

As a demonstration, see the below example command (this simulation may run for several minutes, but feel free to kill it with ctrl-C; we don't need this output for any downstream steps):

slim -d SEED=12345 \
     -d sigma=0.2 \
     -d K=5 \
     -d mu=0 \
     -d r=1e-8 \
     -d W=50 \
     -d G=1e8 \
     -d maxgens=100000 \
     -d OUTNAME="'temp_wd/output'" \
     SLiM_recipes/bat20.slim
       # Note the two sets of quotes around the output name

Command line arguments are passed to SLiM using the -d flag followed by the variable name as it appears in the recipe file.

  • SEED - a random seed to reproduce the simulation results.
  • sigma - the dispersal parameter.
  • K - carrying capacity.
  • mu - per base per genertation mutation rate.
  • r - per base per genertation recombination rate.
  • W - the height and width of the geographic spatial boundaries.
  • G - total size of the simulated genome.
  • maxgens - number of generations to run simulation.
  • OUTNAME - prefix to name output files.

Note: after running SLiM for a fixed number of generations, the simulation is still not complete, as many trees will likely not have coalesced still. We usually finish the simulation backwards in time using mspime, e.g., "recapitation". This can either be done before training (recommended; see example below) or during training using the recapitate flag.

Simulation programs other than SLiM may be used to make training data, as long as the output is processed into tensors of the necessary shape. Given the strict format of the input files, we do not recommend users attempt to generate training data from sources other than SLiM.

In practice, we use 1000 or more simulations like the above one for a given training run (and then subsample from each simulation to achieve a training set of 50,000). Ideally simulations should be run on a high performance computing cluster.

Training

Below is an example command for the training step. This example uses tree sequences as input (again, feel free to kill this command).

python disperseNN.py \
  --train \
  --tree_list Examples/tree_list1.txt \
  --mutate True \
  --min_n 10 \
  --max_n 10 \
  --edge_width 3 \
  --sampling_width 1 \
  --num_snps 1000 \
  --repeated_samples 100 \
  --batch_size 10 \
  --threads 1 \
  --max_epochs 10 \
  --seed 12345 \
  --out temp_wd/out1
  • tree_list: list of paths to the tree sequences. σ values and habitat widths are extracted directly from the tree sequence.
  • mutate: add mutations to the tree sequence until the specified number of SNPs are obtained (5,000 in this case, specified inside the training params file).
  • min_n: specifies the minimum sample size.
  • max_n: paired with min_n to describe the range of sample sizes to drawn from. Set min_n equal to max_n to use a fixed sample size.
  • edge_width: this is the width of edge to 'crop' from the sides of the habitat. In other words, individuals are sampled edge_width distance from the sides of the habitat.
  • sampling_width: samples individuals from a restricted sampling window with width between 0 and 1, in proportion to the habitat width, after excluding edges.
  • num_snps: the number of SNPs to use as input for the CNN.
  • repeated_samples: this is the number of repeated draws of n individuals to take from each tree sequence. This let's us get away with fewer simulations.
  • batch_size: for the data generator. We find that batch_size=40 works well if the training set is larger.
  • threads: number of threads to use for multiprocessing during the data generation step.
  • max_epochs: maximum number of epochs to train for.
  • seed: random number seed.
  • out: output prefix.

This run will eventually print the training progress to stdout, while the model weights are saved to temp_wd/out1_model.hdf5.

Also, this example command is small-scale; in practice, you will need a training set of maybe 50,000, and you will want to train for longer than 10 epochs.

Prediction: tree sequences as input

If you want to predict σ in simulated tree sequences, such as those generated by msprime and SLiM, the predict command will look a bit different than predicting with a VCF as input. Below is an example command (should take <30s to run). Each command line flag is described in the preceding examples.

python disperseNN.py \
  --predict \
  --load_weights Saved_models/pretrained082522_model.hdf5 \
  --training_params Saved_models/pretrained082522_training_params.npy \
  --tree_list Examples/tree_list1.txt \
  --mutate True \
  --min_n 10 \
  --edge_w

Related Skills

View on GitHub
GitHub Stars7
CategoryDevelopment
Updated1y ago
Forks3

Languages

Python

Security Score

55/100

Audited on Aug 2, 2024

No findings