WGSassign

Population assignment methods for whole-genome sequence and genotype likelihood data

WGSassign was developed to conduct standard population assignment methods using genotype likelihood data in Beagle format. This software facilitates the estimation of allele frequencies across reference populations, allows users to perform leave-one-out cross validation with reference populations, and assign individuals of unknown origin to reference populations. See below for descriptions of the functions.

For a more in-depth tutorial on using WGSassign with example data sets, as well as the data analysis components in R, see the workshop hands-on materials from ConGen 2024: https://github.com/mgdesaix/ConGen-2024

Get WGSassign and build

Dependencies

The WGSassign software relies on the following Python packages that you can install through mamba/conda (recommended) or pip:

numpy (<=1.22.3, current WGSassign functionality fails with >=1.22.4)
cython
scipy

You can create an environment through conda/mamba easily as follows using the environment.yml file provided within WGSassign (see below).

Install and build

Installation of WGSassign is simply:

git clone https://github.com/mgdesaix/WGSassign.git
cd WGSassign
mamba env create -f environment.yml
conda activate WGSassign
python setup.py build_ext --inplace
pip3 install -e .

You can now run WGSassign with the WGSassign command.

Note: WGSassign has been tested and used primarily on Linux systems. The standard compiler (clang) on MacOSX does not support openmp. To compile and use this on MacOSX you can use Homebrew to install gcc and openmp and then direct the compiler to use that. Doing so requires setting up some C paths, etc, before running the lines.

python setup.py build_ext --inplace
pip3 install -e .

Setting those paths and running the above lines can all be done within a subshell. See the file ./Mac-compile-and-pip-install.sh for an example.

Usage

Running WGSassign

WGSassign works directly on genotype likelihood files. Beagle genotype likelihood files can be generated from BAM files using ANGSD.

Options:

--% WGSassign
usage: WGSassign [-h] [-b FILE] [-t INT] [-o OUTPUT] [--maf_iter INT] [--maf_tole FLOAT] [--pop_af_IDs FILE] [--get_reference_af] [--pop_names FILE]
                 [--ne_obs] [--loo] [--loo_downsampled_beagle FILE] [--pop_af_file FILE] [--get_pop_like] [--get_assignment_z_score]
                 [--get_reference_z_score] [--ind_ad_file FILE] [--allele_count_threshold INT] [--single_read_threshold] [--ind_start INT]
                 [--ind_end INT] [--pop_like FILE] [--pop_like_IDs FILE] [--get_em_mix] [--get_mcmc_mix] [--mixture_iter INT]

options:
  -h, --help            show this help message and exit
  -b FILE, --beagle FILE
                        Filepath to genotype likelihoods in gzipped Beagle format from ANGSD
  -t INT, --threads INT
                        Number of threads
  -o OUTPUT, --out OUTPUT
                        Prefix for output files
  --maf_iter INT        Maximum iterations for minor allele frequencies estimation - EM (200)
  --maf_tole FLOAT      Tolerance for minor allele frequencies estimation update - EM (1e-4)
  --pop_af_IDs FILE     Filepath to individual IDs and populations for beagle
  --get_reference_af    Estimate allele frequencies for reference populations
  --pop_names FILE      Filepath to population names of allele frequency file
  --ne_obs              Estimate population and individuals effective sample sizes
  --loo                 Perform leave-one-out cross validation
  --loo_downsampled_beagle FILE
                        Optional Beagle file of downsampled genotype likelihoods to use for LOO assignment. (To test accuracy when assigned samples
                        have lower coverage)
  --pop_af_file FILE    Filepath to reference population allele frequencies
  --get_pop_like        Estimate log likelihood of individual assignment to each reference population
  --get_assignment_z_score
                        Calculate z-score for individuals
  --get_reference_z_score
                        Calculate z-score for individuals
  --ind_ad_file FILE    Filepath to individual allele depths, tab-delimited, .txt or .gz
  --allele_count_threshold INT
                        Minimum number of loci needed to keep a specific allele count combination
  --single_read_threshold
                        Use only loci with a single read. Helpful for computational efficiency when individuals's sequencing depths vary.
  --ind_start INT       Start analysis at this individual index (0-index: i.e. 0 starts with the 1st individual)
  --ind_end INT         End analysis at this individual index (0-index: i.e. If you have 10 individuals, 9 is the 10th individual)
  --pop_like FILE       Filepath to population assignment log likelihood file
  --pop_like_IDs FILE   Filepath to IDs for population assignment log likelihood file
  --get_em_mix          Estimate mixture proportions with EM algorithm
  --get_mcmc_mix        Estimate mixture proportions with MCMC algorithm
  --mixture_iter INT    Maximum iterations mixture estimation - EM (200)

Reference population allele frequencies

To create a file of allele frequencies across all reference populations is straightforward: provide 1) a beagle file and 2) a reference ID file. The reference ID file needs to be tab delimited and have 2 columns, with each row being an individual and the 2nd column specifying the reference population the individual belongs to. It is essential that the sample order in the ID file (row-wise) reflects the sample order in the Beagle file (column-wise); though header sample names in the beagle file don't have to match the first column of the ID file. The easiest way to keep the proper order is to create a reference file from whatever BAM list you used to create the beagle file in ANGSD.

Allele frequency is parallelized across loci, so feel free to crank up the threads with --threads.

Specifying --get_reference_af produces the reference population allele file, .pop_af.npy. It is a numpy binary with L (# loci) rows x K (reference population) columns.

The following code examples can be run using the data provided with WGSassign in the data/ directory. Note that population assignment with bird data is typically such that "breeding" individuals are of known origin and "nonbreeding" (e.g. wintering) individuals are of unknown breeding origin. Now you know why the following example variables and files are named such.

breeding_beagle=./data/amre.breeding.ind85.ds_2x.sites-filter.top_50_each.beagle.gz
breeding_IDs=./data/amre.breeding.ind85.reference_k5.IDs.txt
outname=./out/amre.breeding.ind85.ds_2x.sites-filter.top_50_each
# Estimate reference population allele frequencies using 20 threads
# Output = ${outname}.pop_af.npy (numpy binary matrix of size L (# loci) rows x K (ref pops) columns)
WGSassign --beagle ${breeding_beagle} --pop_af_IDs ${breeding_IDs} --get_reference_af --out ${outname} --threads 20

Note: Numpy binaries are produced without text headers. WGSassign will output a text file .pop_names.txt that specifies the order of the populations provided in the .pop_af.npy file.

Reference effective sample sizes

When producing the reference population allele frequency file, WGSassign also has other features you can specify. --ne_obs adds the calculation of the observed Fisher information across loci and the output is a numpy binary of .fisher_obs.npy of the same L x K dimensions. The effective sample size of the reference populations for each locus are also calculated and the output is a numpy binary .ne_obs.npy of the same dimensions. The mean effective sample sizes of the reference populations is provided as output in a text file, ne_obs.txt, with the first row being the population names and the second row being the effective sample size of the population. .ne_ind.txt is a single column of each individual's effective sample size (same order as the file you provide to --pop_af_ID).

WGSassign --beagle ${breeding_beagle} --pop_af_IDs ${breeding_IDs} --get_reference_af --ne_obs --out ${outname} --threads 20

Leave-one-out cross validation

Cross-validation is an important technique for determining assignment accuracy, but recalculating allele frequencies each time you remove an individual can be slow. Fortunately, WGSassign is fast enough to provide leave-one-out assignment in a reasonable amount of time even for large of beagle files (ex. ~ 5 million SNPs and 180 individuals = 30 min; 600k SNPs and 80 individuals = <1 min). When producing reference population allele frequencies (as described above), you can add --loo to specify the calculation of leave-one-out assignment likelihoods for each individual to each reference population. For LOO assignment, the allele frequency of the reference population the individual belongs to is recalculated without the individual, and then the individual is assigned to the different reference populations. The log-likelihoods of LOO assignment are output in a text file with N (individual) rows x K (reference population) columns.

WGSassign --beagle ${breeding_beagle} --pop_af_IDs ${breeding_IDs} --get_reference_af --loo --out ${outname} --threads 20

Leave-one-out cross validation and test with downsampled GLs for samples being assigned

Once a reference baseline of genotype likelihoods is available, it will often be of interest to know whether samples could still be assigned to that reference, even if the samples to be assigned were sequenced at much lower depth. By combining the --loo_downsampled_beagle FILE option with the --loo option you can get the reference allele frequencies estimated from the full-

WGSassign

Install / Use

README

WGSassign

Get WGSassign and build

Dependencies

Install and build

Usage

Running WGSassign

Reference population allele frequencies

Reference effective sample sizes

Leave-one-out cross validation

Leave-one-out cross validation and test with downsampled GLs for samples being assigned

Related Skills