Genomatnn
Predicts adaptive introgression using a CNN trained on genotype matrices.
Install / Use
/learn @grahamgower/GenomatnnREADME
About
Genomatnn is a program for detecting archaic adaptive introgression from
genotype matrices, using a convolutional neutral network (CNN). The CNN is
trained with the tensorflow deep learning
framework, on simulations produced by the SLiM
engine in stdpopsim.
The trained CNN can then be used to predict genomic windows of adaptive
introgression from empirical datasets in the vcf or bcf formats.
If you find this software useful, please cite: Gower et al. (2021), https://doi.org/10.7554/eLife.64669
Pre-trained models corresponding to the publication can be found in the
pre-trained/ folder, and their predictions are in the predictions/ folder.
Installation
The most trouble-free way to use genomatnn is with a
conda virtual environment.
We assume you are using Linux.
MacOS may also work, but this only been minimally tested.
-
Clone the genomatnn git repository.
git clone https://github.com/grahamgower/genomatnn.git cd genomatnn -
Create a new conda environment using one of the supplied environment files. Use
environment.ymlif you intend to do CPU-only training, orenvironment-gpu.ymlfor GPU-based training.conda env create -f environment.yml -n genomatnn # Activate the conda environment. conda activate genomatnn # If using an old version of conda, it may be neccessary to use a different # command to activate the environment: #source activate genomatnn -
With the conda environment still activated, build/install genomatnn.
python setup.py build_ext -i python setup.py install -
Run the tests to check that installation was successful. The tests can take a few minutes to run.
nosetests -v tests
Please open a github issue if you have any trouble, or the tests fail. Be sure to include as much detail as possible.
Usage
The genomatnn command will provide a concise usage summary if it is invoked
without parameters. Alternately, the -h option will print more detailed
usage information. In the text below, the $ indicates the command prompt.
$ genomatnn
usage: genomatnn [-h] {sim,train,eval,apply,vcfplot} ...
genomatnn: error: the following arguments are required: subcommand
$ genomatnn -h
usage: genomatnn [-h] {sim,train,eval,apply,vcfplot} ...
Simulate, train, and apply a CNN to genotype matrices.
positional arguments:
{sim,train,eval,apply,vcfplot}
sim Simulate tree sequences.
train Train a CNN.
eval Evaluate trained CNN.
apply Apply trained CNN.
vcfplot Plot haplotype/genotype matrices from a VCF/BCF.
optional arguments:
-h, --help show this help message and exit
Here we see that genomatnn has several subcommands. sim to run simulations,
train to train a CNN, eval to produce evaluation plots for a trained CNN,
and apply to predict AI on empirical data by applying a trained CNN.
Each of the subcommands may also be invoked without parameters to produce
a concise usage summary, or invoked with the -h option to get more detailed
usage information.
$ genomatnn sim
usage: genomatnn sim [-h] [-j PARALLELISM] [-s SEED] [-v] [-n NUM_REPS] [-l]
conf.toml [modelspec]
genomatnn sim: error: the following arguments are required: conf.toml
$ genomatnn train
usage: genomatnn train [-h] [-j PARALLELISM] [-s SEED] [-v] [-c] conf.toml
genomatnn train: error: the following arguments are required: conf.toml
$ genomatnn apply
usage: genomatnn apply [-h] [-j PARALLELISM] [-s SEED] [-v] [-p] conf.toml nn.hdf5
genomatnn apply: error: the following arguments are required: conf.toml, nn.hdf5
$ genomatnn eval
usage: genomatnn eval [-h] [-j PARALLELISM] [-s SEED] [-v] [--no-reliability]
conf.toml nn.hdf5
genomatnn eval: error: the following arguments are required: conf.toml, nn.hdf5
$ genomatnn sim -h
usage: genomatnn sim [-h] [-j PARALLELISM] [-s SEED] [-v] [-n NUM_REPS] [-l]
conf.toml [modelspec]
positional arguments:
conf.toml Configuration file.
modelspec Model specification to simulate. If not provided, modelspecs from
the config file will be simulated
optional arguments:
-h, --help show this help message and exit
-j PARALLELISM, --parallelism PARALLELISM
Number of processes or threads to use for parallel things. E.g.
simultaneous simulations, or the number of threads used by
tensorflow when running on CPU. If set to zero, os.cpu_count() is
used. [default=0].
-s SEED, --seed SEED Seed for the random number generator [default=1836563304].
-v, --verbose Increase verbosity. Specify twice for messages from third party
libraries (e.g. tensorflow and matplotlib).
-n NUM_REPS, --num-reps NUM_REPS
Number of replicate simulations. For each replicate, one simulation
is run for each modelspec. [default=1]
-l, --list List available model specifications.
There are some options which are common to all subcommands, such as the
-j, -s, and -v options.
Configuration
From the usage information above, we can see that genomatnn subcommands
require a configuration file. A genomatnn analysis proceeds through
multiple steps, and each step needs to be configured consistently with the
previous step(s). The most extreme example of this, is that the number of
individuals that are simulated in the first sim step must match the
empirical data used in the final apply step. While it may at first appear
counterintuitive, this means that to do any simulations, we must first
describe which individuals will be used from the vcf(s), and how these
relate to the populations that will be simulated.
The configuration file uses the toml
format, and we provide an extensively commented
example configuration.
Worked example
Here we provide a worked example for detecting adaptive introgression in humans, where Neanderthals are the donor population and Europeans are the recipient. Create a new working directory. We will use this to store the configuration file, the example vcf, and all output files.
$ mkdir ai_example
$ cd ai_example
Copy the Nea_to_CEU.toml, 1000g.Nea.22.1mb.vcf.gz,
and 1000g.Nea.22.1mb.vcf.gz.csi files from the examples folder of
your cloned genomatnn repository. We will also need the YRI.indlist and
CEU.indlist files, which list the vcf's individual IDs for the YRI and
CEU populations.
$ export GENOMATNN=/path/to/the/cloned/genomatnn
$ cp $GENOMATNN/examples/Nea_to_CEU.toml .
$ cp $GENOMATNN/examples/1000g.Nea.22.1mb.vcf.gz{,.csi} .
$ cp $GENOMATNN/examples/{YRI,CEU}.indlist .
VCF and populations
We will lightly modify the example configuration file. First, change the
dir = ... line at the top to be the current directory. I.e.
dir = "."
Next, find the [vcf] section, and modify the chr = ... and file = ...
parameters to be:
chr = [22]
file = "1000g.Nea.${chr}.1mb.vcf.gz"
This vcf file contains genotype calls for a 1 Mbp region on chr22, for the YRI and CEU populations from the 1000 genomes project, plus the Altai and Vindija Neanderthals.
Scrolling down the config file, we see the [pop] section, which describes
which individuals in the vcf file correspond to which population labels.
The possible population labels here are defined by the demographic model
that will be simulated.
[pop]
# For each population, specify the individual IDs in the VCF. This can either
# be a list of IDs, or the name of a file containing the IDs (one per line).
# The population names must match those used for the simulations.
# The order of populations given here will be used for the ordering in the
# genotype matrices. It's recommended for the donor and recipient populations
# to be adjacent in the genotype matrices!
Nea = ["AltaiNeandertal", "Vindija33.19"]
CEU = "CEU.indlist"
YRI = "YRI.indlist"
Simulating
We are now ready to start simulating. We shall first list the model specifications that are currently defined by genomatnn.
$ genomatnn sim -l Nea_to_CEU.toml
HomSap/PapuansOutOfAfrica_10J19/Neutral/slim
HomSap/PapuansOutOfAfrica_10J19/Neutral/msprime
HomSap/PapuansOutOfAfrica_10J19/DFE
HomSap/PapuansOutOfAfrica_10J19/AI/Den1_to_Papuan
HomSap/PapuansOutOfAfrica_10J19/AI/Den2_to_Papuan
HomSap/PapuansOutOfAfrica_10J19/Sweep/Papuan
HomSap/HomininComposite_4G20/Neutral/slim
HomSap/HomininComposite_4G20/Neutral/msprime
HomSap/HomininComposite_4G20/DFE
HomSap/HomininComposite_4G20/AI/Nea_to_CEU
HomSap/HomininComposite_4G20/Sweep/CEU
This indicates that there are two demographic models available for the
HomSap species: PapuansOutOfAfrica_10J19 and HomininComposite_4G20.
For each of these demographic models, there are multiple lines, which
genomatnn refers to as modelspecs. These describe the different ways we can
simulate a given demographic model. We will use the following modelspecs:
HomSap/HomininComposite_4G20/Neutral/slim: only neutral mutations are simulated, and SLiM will be used for the simulation (rather than msprime).HomSap/HomininComposite_4G20/AI/Nea_to_CEU: adaptive introgression where a mutation is drawn in Neanderthals, passed to Europeans via admixture, and is then positively selected in Europeans.HomSap/HomininComposite_4G20/Sweep/CEU: a selective sweep in Europeans.
In our configuration file, we see that these modelspecs are used in the
[sim.tranche] section.
