WGSassign
Population assignment from genotype likelihoods for low-coverage whole-genome sequencing data
Install / Use
/learn @mgdesaix/WGSassignREADME
WGSassign
Population assignment methods for whole-genome sequence and genotype likelihood data
WGSassign was developed to conduct standard population assignment methods using genotype likelihood data in Beagle format. This software facilitates the estimation of allele frequencies across reference populations, allows users to perform leave-one-out cross validation with reference populations, and assign individuals of unknown origin to reference populations. See below for descriptions of the functions.
For a more in-depth tutorial on using WGSassign with example data sets, as well as the data analysis components in R, see the workshop hands-on materials from ConGen 2024: https://github.com/mgdesaix/ConGen-2024
Get WGSassign and build
Dependencies
The WGSassign software relies on the following Python packages that you can install through mamba/conda (recommended) or pip:
- numpy (<=1.22.3, current WGSassign functionality fails with >=1.22.4)
- cython
- scipy
You can create an environment through conda/mamba easily as follows using the environment.yml file provided within WGSassign (see below).
Install and build
Installation of WGSassign is simply:
git clone https://github.com/mgdesaix/WGSassign.git
cd WGSassign
mamba env create -f environment.yml
conda activate WGSassign
python setup.py build_ext --inplace
pip3 install -e .
You can now run WGSassign with the WGSassign command.
Note: WGSassign has been tested and used primarily on Linux systems.
The standard compiler (clang) on MacOSX does not support openmp. To
compile and use this on MacOSX you can use Homebrew to install gcc and openmp
and then direct the compiler to use that. Doing so requires setting up some
C paths, etc, before running the lines.
python setup.py build_ext --inplace
pip3 install -e .
Setting those paths and running the above lines can all be done within a subshell.
See the file ./Mac-compile-and-pip-install.sh for an example.
Usage
Running WGSassign
WGSassign works directly on genotype likelihood files. Beagle genotype likelihood files can be generated from BAM files using ANGSD.
Options:
--% WGSassign
usage: WGSassign [-h] [-b FILE] [-t INT] [-o OUTPUT] [--maf_iter INT] [--maf_tole FLOAT] [--pop_af_IDs FILE] [--get_reference_af] [--pop_names FILE]
[--ne_obs] [--loo] [--loo_downsampled_beagle FILE] [--pop_af_file FILE] [--get_pop_like] [--get_assignment_z_score]
[--get_reference_z_score] [--ind_ad_file FILE] [--allele_count_threshold INT] [--single_read_threshold] [--ind_start INT]
[--ind_end INT] [--pop_like FILE] [--pop_like_IDs FILE] [--get_em_mix] [--get_mcmc_mix] [--mixture_iter INT]
options:
-h, --help show this help message and exit
-b FILE, --beagle FILE
Filepath to genotype likelihoods in gzipped Beagle format from ANGSD
-t INT, --threads INT
Number of threads
-o OUTPUT, --out OUTPUT
Prefix for output files
--maf_iter INT Maximum iterations for minor allele frequencies estimation - EM (200)
--maf_tole FLOAT Tolerance for minor allele frequencies estimation update - EM (1e-4)
--pop_af_IDs FILE Filepath to individual IDs and populations for beagle
--get_reference_af Estimate allele frequencies for reference populations
--pop_names FILE Filepath to population names of allele frequency file
--ne_obs Estimate population and individuals effective sample sizes
--loo Perform leave-one-out cross validation
--loo_downsampled_beagle FILE
Optional Beagle file of downsampled genotype likelihoods to use for LOO assignment. (To test accuracy when assigned samples
have lower coverage)
--pop_af_file FILE Filepath to reference population allele frequencies
--get_pop_like Estimate log likelihood of individual assignment to each reference population
--get_assignment_z_score
Calculate z-score for individuals
--get_reference_z_score
Calculate z-score for individuals
--ind_ad_file FILE Filepath to individual allele depths, tab-delimited, .txt or .gz
--allele_count_threshold INT
Minimum number of loci needed to keep a specific allele count combination
--single_read_threshold
Use only loci with a single read. Helpful for computational efficiency when individuals's sequencing depths vary.
--ind_start INT Start analysis at this individual index (0-index: i.e. 0 starts with the 1st individual)
--ind_end INT End analysis at this individual index (0-index: i.e. If you have 10 individuals, 9 is the 10th individual)
--pop_like FILE Filepath to population assignment log likelihood file
--pop_like_IDs FILE Filepath to IDs for population assignment log likelihood file
--get_em_mix Estimate mixture proportions with EM algorithm
--get_mcmc_mix Estimate mixture proportions with MCMC algorithm
--mixture_iter INT Maximum iterations mixture estimation - EM (200)
Reference population allele frequencies
To create a file of allele frequencies across all reference populations is straightforward: provide 1) a beagle file and 2) a reference ID file. The reference ID file needs to be tab delimited and have 2 columns, with each row being an individual and the 2nd column specifying the reference population the individual belongs to. It is essential that the sample order in the ID file (row-wise) reflects the sample order in the Beagle file (column-wise); though header sample names in the beagle file don't have to match the first column of the ID file. The easiest way to keep the proper order is to create a reference file from whatever BAM list you used to create the beagle file in ANGSD.
Allele frequency is parallelized across loci, so feel free to crank up the threads with --threads.
Specifying --get_reference_af produces the reference population allele file, .pop_af.npy. It is a numpy binary with L (# loci) rows x K (reference population) columns.
The following code examples can be run using the data provided with WGSassign in the data/ directory. Note that population assignment with bird data is typically such that "breeding" individuals are of known origin and "nonbreeding" (e.g. wintering) individuals are of unknown breeding origin. Now you know why the following example variables and files are named such.
breeding_beagle=./data/amre.breeding.ind85.ds_2x.sites-filter.top_50_each.beagle.gz
breeding_IDs=./data/amre.breeding.ind85.reference_k5.IDs.txt
outname=./out/amre.breeding.ind85.ds_2x.sites-filter.top_50_each
# Estimate reference population allele frequencies using 20 threads
# Output = ${outname}.pop_af.npy (numpy binary matrix of size L (# loci) rows x K (ref pops) columns)
WGSassign --beagle ${breeding_beagle} --pop_af_IDs ${breeding_IDs} --get_reference_af --out ${outname} --threads 20
Note: Numpy binaries are produced without text headers. WGSassign will output a text file .pop_names.txt that specifies the order of the populations provided in the .pop_af.npy file.
Reference effective sample sizes
When producing the reference population allele frequency file, WGSassign also has other features you can specify. --ne_obs adds the calculation of the observed Fisher information across loci and the output is a numpy binary of .fisher_obs.npy of the same L x K dimensions. The effective sample size of the reference populations for each locus are also calculated and the output is a numpy binary .ne_obs.npy of the same dimensions. The mean effective sample sizes of the reference populations is provided as output in a text file, ne_obs.txt, with the first row being the population names and the second row being the effective sample size of the population. .ne_ind.txt is a single column of each individual's effective sample size (same order as the file you provide to --pop_af_ID).
WGSassign --beagle ${breeding_beagle} --pop_af_IDs ${breeding_IDs} --get_reference_af --ne_obs --out ${outname} --threads 20
Leave-one-out cross validation
Cross-validation is an important technique for determining assignment accuracy, but
recalculating allele frequencies each time you remove an individual can be slow.
Fortunately, WGSassign is fast enough to provide leave-one-out assignment in a
reasonable amount of time even for large of beagle files (ex. ~ 5 million
SNPs and 180 individuals = 30 min; 600k SNPs and 80 individuals = <1 min).
When producing reference population allele frequencies (as described above),
you can add --loo to specify the calculation of leave-one-out assignment likelihoods
for each individual to each reference population. For LOO assignment, the allele
frequency of the reference population the individual belongs to is
recalculated without the individual, and then the individual is assigned to the
different reference populations. The log-likelihoods of LOO assignment are output
in a text file with N (individual) rows x K (reference population) columns.
WGSassign --beagle ${breeding_beagle} --pop_af_IDs ${breeding_IDs} --get_reference_af --loo --out ${outname} --threads 20
Leave-one-out cross validation and test with downsampled GLs for samples being assigned
Once a reference baseline of genotype likelihoods is available, it will often be of interest to know whether
samples could still be assigned to that reference, even if the samples to be assigned
were sequenced at much lower depth. By combining the --loo_downsampled_beagle FILE option with
the --loo option you can get the reference allele frequencies estimated from the full-
Related Skills
node-connect
335.2kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
82.5kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
335.2kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
82.5kCommit, push, and open a PR
