AdmixPipe v3: A Method for Parsing and Filtering VCF and PLINK Files for Admixture Analysis

A pipeline that accepts VCF and PLINK files to run through Admixture

Citing AdmixPipe

If using version 3.0+ of this pipeline, please cite the following paper:

S.M. Mussmann, M.R. Douglas, T.K. Chafin, M.E. Douglas 2023. ADMIXPIPE v3: Facilitating Population Structure Delimitation from SNP Data. Bioinformatics Advances 3(1):vbad168. DOI: 10.1093/bioadv/vbad168

If using v2.x or earlier, please cite the following paper:

S.M. Mussmann, M.R. Douglas, T.K. Chafin, M.E. Douglas 2020. AdmixPipe: population analyses in ADMIXTURE for non-model organisms. BMC Bioinformatics 21:337. DOI: 10.1186/s12859-020-03701-4

IMPORTANT CHANGES IN v3.2 (updated 11-November-2023)

This README.md file is for AdmixPipe v3.2, which has several changes, bug fixes, and enhancements to existing modules. Two new modules were also developed for the following purposes:

submission of admixturePipeline.py output to the CLUMPAK pipeline.
assessment of the best K using the evalAdmix package (http://www.popgen.dk/software/index.php/EvalAdmix).

Some outputs from AdmixPipe v2.0 are not compatible with v3.2 because json files are now utilized to record data and file names from early parts of the pipeline that are needed for later modules. If you require the v2.0 scripts for any reason, they are still available from the prior releases on this page (v2.0.2 was the final release of AdmixPipe v2.0).

Other important notes for v3.2:

AdmixPipe v3.2 requires Python 3.
Some command line options have changed slightly (especially long form commands - you can retrieve the current list of commands from any module by executing it with the --help option).
A Docker container is now the preferred method for installation. The code in this github repository may sometimes contain new features that have not yet been committed to the Docker container.
CLUMPAK is now installed in the Docker container.
The submitClumpak.py module will submit admixturePipeline.py outputs to the Docker container installation of CLUMPAK ('Main pipeline' and 'BestK' pipeline).
The data processing and plotting functions of the cvSum.py module underwent a complete rewrite for v3.2.
PLINK .bed and .ped files are accepted as direct input. Individual-based missing data filtering is not enabled for PLINK files.
The '-r / --remove' option was removed from admixturePipeline.py. This option became redundant because individuals not listed in your popmap are now automatically filtered by both VCFtools and PLINK.

Installation & Setup for AdmixPipe v3:

Docker Setup

This pipeline was written for Unix based operating systems, such as the various Linux distributions and Mac OS X. As of v3.2, we have achieved greater platform independence and ease of installation through development of a Docker container. This is the preferred method for running AdmixPipe. To get started, install Docker on your machine and pull the Docker image using the following command:

docker pull mussmann/admixpipe:3.2

Launch the container by first placing the runDocker.sh script in the folder from which you want to run the container. Then execute the script as shown below.

./runDocker.sh

This script creates a folder named "data" in the directory on your machine from which you launched the Docker container. You can put any input files for AdmixPipe v3.2 into this folder and they will be accessible inside the container (in /app/data/). Any outputs written to this folder and any of its subdirectories will still be accessible after you exit the container. If you write any output to other locations inside the container, they will be lost upon exit. All required AdmixPipe modules and dependency programs have been configured within the container and, unless noted otherwise, will function with the commands provided throughout the remainder of this documentation.

If running the runDocker.sh script on your machine requires sudo permission, you can create a docker users group and add your username to that group. This can be accomplished with the following, if you are running the command from your own user account. If you are running the command for another user, replace ${USER} with their username:

sudo groupadd docker
sudo usermod -aG docker ${USER}

If you add a docker users group you may need to restart your computer before the changes take effect.

Manual Setup

Manual installation of AdmixPipe v3.2 is not advised due to the many required dependencies. However, if you insist upon installing the pipeline manually, I have provided detailed instructions at the end of this guide.

Running AdmixPipe v3

AdmixPipe v3 is composed of five different modules. Follow the links below in the table of contents to find specific instructions for running each module. More detailed manual installation instructions are also provided if you cannot / do not want to use Docker container.

1. admixturePipeline.py: <a name="admixturepipeline"></a>

This module takes standard genotype data files (VCF or BED/PED) as input, conducts filtering according to user-specified parameters, performs all necessary file conversions, and finally executes Admixture on the filtered dataset according to user-specified parameters.

New feature in AdmixPipe v3.2: This module now filters individuals that are absent from your popmap file. For example, if you want to exclude an individual sample from your analysis, just leave it out of your popmap file and it will be removed from your dataset before admixture is executed.

Usage: <a name="admixusage"></a>

You can run the program to print help options with the following command:

./admixturePipeline.py -h

Required options:<a name="admixoptions"></a>

-m / --popmap: Specify a tab-delimited population map (sample --> population). Click here for an example. This will be converted to a population list that can be input into a pipeline such as CLUMPAK (http://clumpak.tau.ac.il/) for visualization of data

One of the following three options is also required:

-b / --bed: Specify a binary plink file (.bed) for input. This option disables some individual sample-based filtering options in the program.
-p / --ped: Specify a text-based plink file (.ped) for input. File should have been produced using the --recode12 option in plink. This option disables some individual sample-based filtering options in the program.
-v / --vcf: Specify a VCF file for input.

Optional arguments:

-n / --np: Specify the number of processors. Currently the only multithreaded program is Admixture.

Admixture optional arguments:

-c / --cv: Specify the cross-validation number for the Admixture program. See the admixture program manual for more information (default = 20)
-H / --haploid: If you want to perform haploid data analysis in Admixture, you can use this option to provide information in the same manner it would be provided to Admixture (e.g., -H "*", see page 10 of the Admixture user manual).
-k / --minK: Specify the minimum K value to be tested (default = 1).
-K / --maxK: Specify the maximum K value to be tested (default = 20).
-R / --rep: Specify the number of replicates for each K value (default = 20)

General filtering options (enabled for both VCFtools and PLINK):

-a / --maf: Enter a minimum frequency for the minor allele frequency filter. (default = off, specify a value between 0.0 and 1.0 to turn it on).
-M / --mac: Enter the minimum count for the minor allele filter. (default = off, specify a positive integer to turn it on).
-S / --snpcov: Filter SNPs based on proportion of allowable missing data. Feature added by tkchafin. (default = 0.1; defined to be between 0 and 1, where 0 allows sites that are completely missing and 1 indicates no missing data allowed; input = float).
-t / --thin: Filter loci by thinning out any loci falling within the specified proximity to one another, measured in basepairs. (default = off, specify an integer greater than 0 to turn it on).

VCFtools filtering options:

-B / --bi: Turns biallelic filter off. (default = on, we do not recommend turning this setting off because ADMIXTURE only processes biallelic SNPs)
-C / --indcov: Filter samples based on maximum allowable missing data. Feature added by tkchafin. (default = 0.9, input = float).

Example:

The preferred usage of the program is to provide a .vcf file as input. The following command will run the program from K values 1 through 10, conducting 10 repetitions at each K value. Admixture will use 16 processors for execution during each re

AdmixturePipeline

Install / Use

README