About

Genomatnn is a program for detecting archaic adaptive introgression from genotype matrices, using a convolutional neutral network (CNN). The CNN is trained with the tensorflow deep learning framework, on simulations produced by the SLiM engine in stdpopsim. The trained CNN can then be used to predict genomic windows of adaptive introgression from empirical datasets in the vcf or bcf formats.

If you find this software useful, please cite: Gower et al. (2021), https://doi.org/10.7554/eLife.64669

Pre-trained models corresponding to the publication can be found in the pre-trained/ folder, and their predictions are in the predictions/ folder.

Installation

The most trouble-free way to use genomatnn is with a conda virtual environment. We assume you are using Linux. MacOS may also work, but this only been minimally tested.

Clone the genomatnn git repository.

git clone https://github.com/grahamgower/genomatnn.git
cd genomatnn

Create a new conda environment using one of the supplied environment files. Use environment.yml if you intend to do CPU-only training, or environment-gpu.yml for GPU-based training.

conda env create -f environment.yml -n genomatnn
# Activate the conda environment.
conda activate genomatnn
# If using an old version of conda, it may be neccessary to use a different
# command to activate the environment:
#source activate genomatnn

With the conda environment still activated, build/install genomatnn.
```
python setup.py build_ext -i
python setup.py install
```
Run the tests to check that installation was successful. The tests can take a few minutes to run.
```
nosetests -v tests
```

Please open a github issue if you have any trouble, or the tests fail. Be sure to include as much detail as possible.

Usage

The genomatnn command will provide a concise usage summary if it is invoked without parameters. Alternately, the -h option will print more detailed usage information. In the text below, the $ indicates the command prompt.

$ genomatnn
usage: genomatnn [-h] {sim,train,eval,apply,vcfplot} ...
genomatnn: error: the following arguments are required: subcommand

$ genomatnn -h
usage: genomatnn [-h] {sim,train,eval,apply,vcfplot} ...

Simulate, train, and apply a CNN to genotype matrices.

positional arguments:
  {sim,train,eval,apply,vcfplot}
    sim                 Simulate tree sequences.
    train               Train a CNN.
    eval                Evaluate trained CNN.
    apply               Apply trained CNN.
    vcfplot             Plot haplotype/genotype matrices from a VCF/BCF.

optional arguments:
  -h, --help            show this help message and exit

Here we see that genomatnn has several subcommands. sim to run simulations, train to train a CNN, eval to produce evaluation plots for a trained CNN, and apply to predict AI on empirical data by applying a trained CNN.

Each of the subcommands may also be invoked without parameters to produce a concise usage summary, or invoked with the -h option to get more detailed usage information.

$ genomatnn sim
usage: genomatnn sim [-h] [-j PARALLELISM] [-s SEED] [-v] [-n NUM_REPS] [-l]
                     conf.toml [modelspec]
genomatnn sim: error: the following arguments are required: conf.toml

$ genomatnn train 
usage: genomatnn train [-h] [-j PARALLELISM] [-s SEED] [-v] [-c] conf.toml
genomatnn train: error: the following arguments are required: conf.toml

$ genomatnn apply
usage: genomatnn apply [-h] [-j PARALLELISM] [-s SEED] [-v] [-p] conf.toml nn.hdf5
genomatnn apply: error: the following arguments are required: conf.toml, nn.hdf5

$ genomatnn eval
usage: genomatnn eval [-h] [-j PARALLELISM] [-s SEED] [-v] [--no-reliability]
                      conf.toml nn.hdf5
genomatnn eval: error: the following arguments are required: conf.toml, nn.hdf5

$ genomatnn sim -h
usage: genomatnn sim [-h] [-j PARALLELISM] [-s SEED] [-v] [-n NUM_REPS] [-l]
                     conf.toml [modelspec]

positional arguments:
  conf.toml             Configuration file.
  modelspec             Model specification to simulate. If not provided, modelspecs from
                        the config file will be simulated

optional arguments:
  -h, --help            show this help message and exit
  -j PARALLELISM, --parallelism PARALLELISM
                        Number of processes or threads to use for parallel things. E.g.
                        simultaneous simulations, or the number of threads used by
                        tensorflow when running on CPU. If set to zero, os.cpu_count() is
                        used. [default=0].
  -s SEED, --seed SEED  Seed for the random number generator [default=1836563304].
  -v, --verbose         Increase verbosity. Specify twice for messages from third party
                        libraries (e.g. tensorflow and matplotlib).
  -n NUM_REPS, --num-reps NUM_REPS
                        Number of replicate simulations. For each replicate, one simulation
                        is run for each modelspec. [default=1]
  -l, --list            List available model specifications.

There are some options which are common to all subcommands, such as the -j, -s, and -v options.

Configuration

From the usage information above, we can see that genomatnn subcommands require a configuration file. A genomatnn analysis proceeds through multiple steps, and each step needs to be configured consistently with the previous step(s). The most extreme example of this, is that the number of individuals that are simulated in the first sim step must match the empirical data used in the final apply step. While it may at first appear counterintuitive, this means that to do any simulations, we must first describe which individuals will be used from the vcf(s), and how these relate to the populations that will be simulated. The configuration file uses the toml format, and we provide an extensively commented example configuration.

Worked example

Here we provide a worked example for detecting adaptive introgression in humans, where Neanderthals are the donor population and Europeans are the recipient. Create a new working directory. We will use this to store the configuration file, the example vcf, and all output files.

$ mkdir ai_example
$ cd ai_example

Copy the Nea_to_CEU.toml, 1000g.Nea.22.1mb.vcf.gz, and 1000g.Nea.22.1mb.vcf.gz.csi files from the examples folder of your cloned genomatnn repository. We will also need the YRI.indlist and CEU.indlist files, which list the vcf's individual IDs for the YRI and CEU populations.

$ export GENOMATNN=/path/to/the/cloned/genomatnn
$ cp $GENOMATNN/examples/Nea_to_CEU.toml .
$ cp $GENOMATNN/examples/1000g.Nea.22.1mb.vcf.gz{,.csi} .
$ cp $GENOMATNN/examples/{YRI,CEU}.indlist .

VCF and populations

We will lightly modify the example configuration file. First, change the dir = ... line at the top to be the current directory. I.e.

dir = "."

Next, find the [vcf] section, and modify the chr = ... and file = ... parameters to be:

chr = [22]
file = "1000g.Nea.${chr}.1mb.vcf.gz"

This vcf file contains genotype calls for a 1 Mbp region on chr22, for the YRI and CEU populations from the 1000 genomes project, plus the Altai and Vindija Neanderthals.

Scrolling down the config file, we see the [pop] section, which describes which individuals in the vcf file correspond to which population labels. The possible population labels here are defined by the demographic model that will be simulated.

[pop]
# For each population, specify the individual IDs in the VCF. This can either
# be a list of IDs, or the name of a file containing the IDs (one per line).
# The population names must match those used for the simulations.
# The order of populations given here will be used for the ordering in the
# genotype matrices. It's recommended for the donor and recipient populations
# to be adjacent in the genotype matrices!
Nea = ["AltaiNeandertal", "Vindija33.19"]
CEU = "CEU.indlist"
YRI = "YRI.indlist"

Simulating

We are now ready to start simulating. We shall first list the model specifications that are currently defined by genomatnn.

$ genomatnn sim -l Nea_to_CEU.toml 
HomSap/PapuansOutOfAfrica_10J19/Neutral/slim
HomSap/PapuansOutOfAfrica_10J19/Neutral/msprime
HomSap/PapuansOutOfAfrica_10J19/DFE
HomSap/PapuansOutOfAfrica_10J19/AI/Den1_to_Papuan
HomSap/PapuansOutOfAfrica_10J19/AI/Den2_to_Papuan
HomSap/PapuansOutOfAfrica_10J19/Sweep/Papuan
HomSap/HomininComposite_4G20/Neutral/slim
HomSap/HomininComposite_4G20/Neutral/msprime
HomSap/HomininComposite_4G20/DFE
HomSap/HomininComposite_4G20/AI/Nea_to_CEU
HomSap/HomininComposite_4G20/Sweep/CEU

This indicates that there are two demographic models available for the HomSap species: PapuansOutOfAfrica_10J19 and HomininComposite_4G20. For each of these demographic models, there are multiple lines, which genomatnn refers to as modelspecs. These describe the different ways we can simulate a given demographic model. We will use the following modelspecs:

HomSap/HomininComposite_4G20/Neutral/slim: only neutral mutations are simulated, and SLiM will be used for the simulation (rather than msprime).
HomSap/HomininComposite_4G20/AI/Nea_to_CEU: adaptive introgression where a mutation is drawn in Neanderthals, passed to Europeans via admixture, and is then positively selected in Europeans.
HomSap/HomininComposite_4G20/Sweep/CEU: a selective sweep in Europeans.

In our configuration file, we see that these modelspecs are used in the [sim.tranche] section.

Genomatnn

Install / Use

README