ExDFOIL
Automated, exhaustive application of DFOIL to multi-individual RADseq data using GNU Parallel
Install / Use
/learn @slambert-git/ExDFOILREADME
Exhaustive D<sub>FOIL</sub> (ExD<sub>FOIL</sub>)
This repository contains scripts and example data for the "exhaustive" D<sub>FOIL</sub> (https://github.com/jbpease/dfoil) analyses of Lambert et al. "Inferring historical introgression using RADseq and D<sub>FOIL</sub>: power and pitfalls revealed in a case study of spiny lizards (Sceloporus)" (in press, Molecular Ecology Resources). The idea behind "ExD<sub>FOIL</sub>" is simply the application of the D<sub>FOIL</sub> method for inferring introgression to all combinations of sequences from a multiple sequence alignment that meet the assumptions of D<sub>FOIL</sub>, according to a user-defined, rooted phylogenetic tree. The ExD<sub>FOIL</sub> helper scripts will automate and parallelize this process, useful when the number of unique combinations of individuals to consider is large (e.g., >200,000 as in the manuscript). Example scripts for summarizing and visualizing the results over phylogenetic and geographic space in R are also provided. Scripts to run single-site-per-locus and bootstrap resampling analyses are found in the AppendixS2 folder. Scripts for converting from .vcf to TreeMix format (using individuals or populations) are found in the AppendixS3 folder.
What's it for?
In the manuscript, we combine this ExD<sub>FOIL</sub> approach with RADseq data to assess the evidence for nuclear introgression between two spiny lizard species for which mitochondrial data independently suggests recurrent introgression. We think the ExD<sub>FOIL</sub> approach could also be used as a tool for data exploration without a specific a-priori hypothesis for introgression.
Disclaimer / License
ExD<sub>FOIL</sub> is a collection of R and shell scripts for 1) selecting appropriate combinations of taxa for D<sub>FOIL</sub> (https://github.com/jbpease/dfoil) given a rooted phylogenetic tree, 2) running D<sub>FOIL</sub> tests in parallel using GNU parallel (https://www.gnu.org/software/parallel/), and 3) summarizing and visualizing the results in R.
It is not standalone software, and it does not alter the underlying D<sub>FOIL</sub> method in any way. It is provided under a GNU General Public License (see LICENSE.txt).
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see <https://www.gnu.org/licenses/>.
Dependencies
Unix shell (tested with GNU bash v4.3.48(1)-release and 4.1.2(2)-release), GNU awk (tested with v3.1.7 and v4.1.3), GNU parallel (tested on 20160722 and 20141022), R v3.4.1 and the packages stringr v1.0.0 / phytools v0.6-44 / ape v5.0 / ggtree v1.10, and D<sub>FOIL</sub> https://github.com/jbpease/dfoil (commit on 09-19-2018). Also required is selectSeqs.pl by Dr. Naoki Takebayashi (found in Scripts/). All versions provided are those tested on; older or younger versions may or may not work.
Note that the scripts for Stage 2) assume that you have the scripts from Scripts/ in your working directory with executable permissions (or $PATH), as well as the scripts of D<sub>FOIL</sub>.
Quick Start
Select taxa:
To just select taxa for D<sub>FOIL</sub>, you will need 1) a list of taxa (mynamesfile) and 2) a rooted phylogenetic tree (mytreefile):
DFOIL_Picker.R mynamesfile mytreefile
Run exhaustive D<sub>FOIL</sub>:
To run all D<sub>FOIL</sub> tests in parallel, you will need 1) A list of taxon combinations (mynames) formatted like the output of DFOIL_Picker.R 2) a fasta file (myfasta) and 3) a file containing the name of the outgroup sequence (myoutgroup).
## convert fasta files to count files
fasta2foiler.sh mynames myfasta myoutgroup
## run dfoil
dfoiler_alt.sh
## summarize the results
parallel 'summarizer_alt.sh < <(echo {})' ::: counts/*.counts
cat summary_alt/*.summary_alt > all.summary.alt.txt
Tutorial
This section will guide the user through the selection of taxa, application of D<sub>FOIL</sub> in parallel, and subsequent visual summaries of results conducted for Lambert et al. (submitted) step-by-step. It is divided into three major stages: 1) Selection of Taxa; 2) Execution of D<sub>FOIL</sub> using GNU parallel; and 3) Summarizing and Visualizing Results. Example input files provided, allowing the user to reproduce the analyses of the targeted dataset with reduced-individual sampling from Lambert et al. (submitted). Note however that there are >30,000 tests to run in this case, and >100,000 output files will be produced. Using 28 cores on the University of Arizona HPC, running Stage 2) for the example data took between 1.5 and 2 hours. Computational time should scale roughly linearly with the number of processors.
Stage 1: Selection of Taxa
The R script DFOIL_Picker.R will accept as standard input the file paths of (1) a list of taxa and (2) a rooted phylogenetic tree; and return all unique sets of four taxa that are arranged in a symmetric tree, listing the younger clade first. It is important that the younger clade is listed first because the D<sub>FOIL</sub> script fasta2dfoil.py expects this, and if the order is switched, ancestral introgression signatures are not interpretable (Pease & Hahn 2015). The members of each possible symmetrical four-taxon subtrees are written to a line of ./DFOIL_picked.txt as e.g. "taxon1 taxon2 taxon3 taxon4".
Note that DFOIL_Picker.R will not check if your tree is rooted for you! It can be executed as follows:
chmod +x DFOIL_Picker.R
Rscript DFOIL_Picker.R mynamesfile mytreefile
Where mynamesfile is the path to a plain-text file of taxon names, each on a separate line, and mytreefile is the path to a Newick-format treefile that contains (at least) all of the taxa in the names file. Example input files are provided as ornatus_names.txt and ornatus_tree.txt.
For the manuscript, I excluded combinations that did not have at least one representative of S. ornatus and one representative of S. oberon, using grep on the output of DFOIL_Picker.R:
grep "oberon" DFOIL_picked.txt | grep "ornatus" > names_filtered.txt
You may want to do something similar if there are hypotheses that you're not interested in testing; for example, tests involving only one species. We will use this filtered file, which should contain 32,368 unique combinations, for the subsequent stages of the tutorial.
Stage 2: Running D<sub>FOIL</sub> with GNU Parallel
Getting site-pattern counts
We then use the D<sub>FOIL</sub> script fasta2dfoil.py to get the site-pattern count data for each of our tests, parallelizing using GNU parallel. This step uses the script fasta2foiler.sh with three command line arguments:
fasta2foiler.sh mynames myfasta myoutgroup
mynames is the path to the file containing names as formatted by DFOIL_Picker.R (e.g., names_filtered.txt if following the tutorial)
myfasta is the path to a fasta file containing (at least) all individuals listed ("oberon_ornatus_reduced.fasta" for the example data).
myoutgroup is the path to a file with one line, containing the name of the outgroup sequence. ("minor_OG.txt" for the example data).
The fasta2foiler.sh command will subset the fasta file containing all individuals to contain only the individuals in each test using N. Takebayashi's perl script selectSeqs.pl. This script will also ensure that the taxa in the correct order for D<sub>FOIL</sub>. However, process substitution <(command here) is used to pass the output directly to fasta2dfoil.py, so these subsetted fasta files are not written permanently to disk. Files containing the site-pattern counts for each test are written as e.g., taxon1.taxon2.taxon3.taxon4.counts in a folder ./counts/.
Here is the content of the fasta2foiler.sh script:
#!/bin/bash
mkdir -p counts/
names=$1
fasta=$2
outgroup=$3
cat $names | parallel --env outgroup --env fasta 'fasta2dfoil.py <(selectSeqs.pl -f <(echo {} | tr " " $"\n"; cat '$outgroup') '$fasta') --out counts/$(echo {} | tr " " ".").counts --names $(echo {} | tr " " ","),$(cat '$outgroup')'
Please note that running the example data will take a while, there are >30,000 tests, and this step is the longest of any, per-test. Removing any sequences from the fasta file that are not used at all in the tests should save some computational time (you can use selectSeqs.pl for this). Using 28 processors on the UA HPC, this step still takes approximately one hour for the example data.
You could of course save the .fasta files to disk; this would save computational time if the counts needed to be re-done (at the expense of disk space). There is a script in the AppendixS2 folder that will accept a directory of fasta files rather than a single fasta( ). However, selectSeqs.pl is still run on the resulting fastas in order to ensure the taxa are in the correct order for D<sub>FOIL</sub>; this step could also be ran in advance to save more compuational time.
Running D<sub>FOIL</sub>
We then run dfoil.py on each of the count files using the wrapper script dfoiler_alt.sh. In the same directory containing your "names/" and "counts/" directories, execute, e.g.:
dfoiler_alt.sh
Here are the contents of the dfoiler_alt.sh script:
#!/bin/bash
mkdir -p dfoil
mkdir -p precheck
find ./counts/ -type f -name '*.counts' | parallel python dfoil.py --mode dfoilalt --infile {} --out
