AdaptiPhy: Implementation with snakemake

This code updates the existing AdaptiPhy 2.0 pipeline to run using snakemake, allowing the user to plug in data at the beginning of the pipeline & performing the intermediate steps automatically and reproducibly.

If you are a snakemake 1.0 user, we have deprecated the original version and those files can be found in this repo under the folder 'deprecated'.

1. Downloading the AdaptiPhy pipeline 2.0 to run with snakemake

1a. Recommended: Download the tarball release from the "Releases" page

To use the AdaptiPhy 2.3 compressed static release, please navigate to the Releases menu on the right-hand side of this README landing page. Please note that the test data to confirm that AdaptiPhy is functioning properly must be separately downloaded from Zenodo and moved into a subdirectory called "data". This option is ready, please download and unzip the file.

1b. Not 100% recommended: Clone the git repo

To clone this repo from the command line into your working directory, use:

git clone https://github.com/wodanaz/adaptiPhy

You will need to add your files to the data/ directory before running the snakemake pipeline for the first time (or download the test data from Zenodo as described above). Read on for more info about the necessary file structure in this folder!

2. Dependencies: Installing snakemake in a conda environment

The majority of the conda packages required in the pipeline will be loaded automatically, and you will not need to do any sort of manual install. However, you will need to create a top-level environment that contains snakemake and python in order to run this pipeline.

Please install it in a directory of your choosing where you keep conda environments. You can use the following .yaml file structure:

nano snakemake.yml

name: snakemake
channels:
  - conda-forge
  - bioconda
  - defaults
dependencies:
  - snakemake=9.1
  - python=3.11
  - snakemake-executor-plugin-slurm
  - bzip2

Note that snakemake-executor-plugin-slurm and bzip2 may be excluded if you have no intention of running AdaptiPhy on a SLURM-managed computing cluster and/or prefer to decompress files with a tool other than bzip2.

To initialize a conda environment from this .yaml file:

conda env create --file snakemake.yml

To invoke or activate this environment to run AdaptiPhy 2.0, load it with:

conda activate snakemake

Advanced: If you don't set a --prefix, make sure that your .condarc file has a specified location to save environments to. If you are using a shared computing space, not specifying a stable install location can result in a loss of environment files.

3. Change the necessary file parameters

To run the snakemake pipeline either interactively or through a job manager like SLURM, you will need to update some file paths and other information in your copy of this repository.

./config.yaml: This is the file where you will need to update the most information. You will need:
- windows: path to a list of ATAC peaks or similar targets (your query genome coordinates) in BED file format
- num_replicates and min_frac: most users will not need to adjust these parameters. num_replicates determines the number of reference alignments sampled by HyPhy to compare the query alignment against. min_frac determines the filter percentage for screening out high N/missing sequence alignments in the reference.
- tree_topology and foreground_branches: provide a phylogenetic tree in Newick format and specify which branches are focal. For the first parameter, use standard Newick format. For the second, provide a vector of the focal branch names from the Newick tree.
- maf_pattern and fa_pattern: provide paths to files (wildcards permitted) for one or many chromosomes. Note that both the .maf (multi alignment) and .fa (nucleotide) files are required.
- neutral_set: provide a path to a .txt file that contains paths to neutral proxy files, or set this parameter to "goodalignments.txt" if running AdaptiPhy in local mode. More on this later!
- chromosomes: provide a vector of chromosomes to examine.
  
  Example:

# INPUT SPLITS ##############################################################################################################
windows: "data/ncHAE.v2.bed" # thutman.bed is a large run,  thisfile contains > 100K wich implies > 1M replicates. try with a smaller bed file in the directory ... ncHAE.v2.bed contains 2408 human accelerated elements
num_replicates: 10
min_frac: 0.9

# TREE TOPOLOGY #############################################################################################################
tree_topology: "(rheMac3,(ponAbe2,(gorGor3,(panTro4,hg19))))"
#tree_topology_named: "(rheMac3,(ponAbe2,(gorGor3,(panTro4,hg19)Node6)))"  # OPTIONAL, add the internal node id for phyloFit, to know the name assigned by ADAPTIPHY/HYPHY, please look into the readme.md
foreground_branches: ["hg19", "panTro4"]  # if running an inner branch as foreground, please add branch here (i.e. foreground_branches: ["hg19", "Node6"] )

# GENOME TARGET FILES #######################################################################################################
# provide the input file to be split by phast's msa_split here. this file can be in a .fasta, phylip, mpm, maf, or ss file
# format. msa_split will try to guess the contents.
# if this fails, the snakefile may need to be modified to have an --in-format parameter specifying the file type. We typically
# provide a MAF file.
maf_pattern: "data/{chrom}.primate.maf"
#if providing a MAF file, provide the reference sequence location here.
fa_pattern: "data/{chrom}.fa"

# LOCAL VS GLOBAL RUN SPECIFICATION ##########################################################################################
# If running a local version of adaptiphy, no neutral sequence is required. Set the parameter below to "goodalignments.txt".
# If running a global version of adaptiphy, provide a neutral set file. Keep in mind that if you perform a local run of
# adaptiphy (meaning that you set neutral_set to "goodalignments.txt") in a global (whole-genome) run, your neutral set
# is a random sampling of the genome, which may not have a significant effect (see Berrio et. al. BMC) but caveat emptor.
neutral_set: "neutral_smk/neutralset.txt"
#options are: local = "goodalignments.txt", global = path to neutral set
chromosomes: ["chr1","chr2","chr3","chr4","chr5","chr6","chr7","chr8","chr9","chr10","chr11","chr12","chr13","chr14","chr15","chr16","chr17","chr18","chr19","chr20","chr21","chr22", "chrX"]
#"chr" if one sequence (i.e. viral genome, one chromosome only in the file provided") or specific chromosomes to target if
# using a multi-chromosome genome (i.e. "chr19", etc)

data/: your input data lives in this folder. To run the AdaptiPhy pipeline, this folder must contain:
- a folder containing MAF and .fa files, matching the specified 'pattern' paths in your config.yaml file from the previous step
- a file of target windows/peak calls, matching the 'windows' path in your config.yaml file from the previous step
- if running AdaptiPhy in global mode (more on this later), a .txt file containing a list of paths to neutral proxy .fa files and a directory containing those neutral proxy .fa files

In the example below, you can see how the data structure for a run to identify the neutral proxy and the regions evolving fast should be:

.
|-- adaptiphy-launch-slurm.sh
|-- config.yaml
|-- data
|   |-- chr10.fa
|   |-- chr10.primate.maf
|   |-- chr11.fa
|   |-- chr11.primate.maf
     ...
|   |-- hg19
|   |-- ncHAE.v2.bed
|   |-- ncHAE.v3.bed
|   |-- thurman.bed
|   `-- thurman.v3.bed
|-- envs
|   `-- biopython.yaml
|-- local_data
|   |-- allpeaks.bed
|   |-- peaks.bed
|   |-- SARS_CoV_2.fasta
|   `-- SARS_CoV_2.multiple.fasta
|-- neutral_smk
|   |-- config.yaml
|   |-- data
|   |   |-- chr10.masked.fa
|   |   |-- chr10.masked.maf
|   |   |-- chr11.masked.fa
|   |   |-- chr11.masked.maf
         ...
|   |   |-- hg19.fa
|   |   `-- hg19.fa.fai
|   |-- envs
|   |   `-- biopython.yaml
|   |-- neutrality-launch-smk.sh
|   |-- scripts
|   |   |-- parse_neutral.py
|   |   `-- select_and_filter_neutral.py
|   |-- slurm_general
|   |   `-- config.yaml
|   `-- Snakefile
|-- scripts
|   |-- alt4-fgrnd_spec.model
|   |-- bf_generator.py
|   |-- calculate_zeta.py
|   |-- DictGen.py
|   |-- extract_res.py
|   |-- null4-fgrnd_spec.model
|   `-- select_and_filter.py
|-- slurm_general
|   `-- config.yaml
`-- Snakefile

The next two files to modify are only required if you intend to run AdaptiPhy 2.0 with the job handler SLURM. Ignore these if you will only be running AdaptiPhy locally, or on an interactive node/on a different job scheduler in interactive mode.

./adaptiphy-launch-slurm.py (optional): update this script if you are planning on using SLURM as a job manager to run the AdaptiPhy snakemake (preferred).
- modify the header of this file to point to your snakemake conda env and email. Example:

#!/usr/bin/env bash
#SBATCH --mail-type=END
#SBATCH --mail-user=email@university.edu
#SBATCH -N 1
#SBATCH --account=sciencelab
#SBATCH --partition=common
#SBATCH --mem=10G
#SBATCH -J adaptiphy
#SBATCH --time=3-00:00:00

set -euo pipefail

source ~/miniconda3/etc/profile.d/conda.sh
conda activate snakemake

snakemake \
  --profile slurm_general \
  --use-conda \
  --conda-prefix /path/to/your/conda/directories \
  --keep-going

slurm_general/config.yaml (optional): update this file if you are planning on using SLURM as a job manager to run the AdaptiPhy snakemake (preferred). Do not modify this fil

AdaptiPhy

Install / Use