MAPGD
A program for the Maximum-likelihood analysis of population genomic data.
Install / Use
/learn @LynchLab/MAPGDREADME
Download MAPGD
MAPGD version 0.5
(C) Michael Lynch and Matthew Ackerman
<b>Visit the MAPGD development page</b>
WARNING --- mapgd relatedness may show unexpected behavior. Use with care.
<h2> Contents </h2>Introduction
Quick start
FAQ
Changes
Basic Usage
Commands
Common Options
Input/Output
Log-likelihood ratios
Example Analysis
Other Useful Programs
Performance
Statistical
Computational
Misc.
IU users
Sanger users
References
<h2> Introduction </h2>MAPGD is a series of related programs that estimate allele frequency, heterozygosity, Hardy-Weinberg disequilibrium, linkage disequilibrium and identity-by-descent (IBD) coefficients from population genomic data using a statistically rigorous maximum likelihood approach. It is primarily useful for the analysis of low coverage population genomic data or for the analysis of pooled data (where many individuals are used to prepare a single sample).
<h2> Quick start </h2> <h6>Installation</h6>After clicking the "Download MAPGD.zip" button you will be prompted to save or open the file MAPGD-master.zip. Save this this file to the directory of your choice, then go to this directory in a terminal, for example /home/matthew/Downloads/
Then type:
unzip MAPGD.zip
cd MAPGD-master/
./configure
make
The program can be installed for all users of a computer by typing:
sudo make install
You will be prompted for your super-user password. If you do not have a super-user password for the system on which you are installing the software, you can type:
make install DESTDIR=~/bin/
This will install the software in the ~/bin/ directory which should allow you to use the software by simply typing 'mapgd'. If not, add the following line to your .bashrc file:
PATH=$PATH:~/bin
A quick test to make sure everything is working correctly can be conducted by typing:
make test
This will output a of lines ending in PASS or FAIL to your terminal. Ideally all of the lines should say PASS.
<h6>Mac installation</h6>Mac users may not have developmental tools installed by default, or you may not have agreed to the xcode licence. You may have to install and configure xcode before using mapgd. Once you have xcode you can type:
make noomp.
<h6> Using mapgd </h6>
Mapgd works a number of commands each with their own associated help.
Generally you will start by creating mpileup files with samtools mpileup command, converting these mpileup files to the pro format with mapgd's proview command, and then beginning your analysis. Running the script "make_and_align_reads.sh" in the src\test\ directory will simulate a genomics study and then run mapgd on the simulated data. You may also want to look at an Example Analysis.
For analyzing individual labeled data you will probably want a command like this:
samtools mpileup -q 25 -Q 25 -B individual1.sort.bam individual2.sort.bam
| mapgd proview -H seq1.header | mapgd allele
| mapgd filter -p 22 -E 0.01 -c 50 -C 200 -o allelefrequency-filtered.map
The filter command will limit the output to sites where the log-likelihood ratio of polymorphism is greater than 22 (-p 22), the error rate is less than 0.01 (-E 0.01) and the population coverage is between 50 and 200.
And for analyzing pooled data you will probably want a command like this:
samtools mpileup -q 5 -Q 5 -B population1.sort.bam population2.sort.bam
| mapgd proview -H seq1.header | mapgd pool -a 22 -o allelefrequency-filtered.pol
The -a option will limit the output to sites where the log-likelihood ratio of polymorphism is greater than 22, which is a relatively stringent criteria.
In the case where the allele command is being used to estimated the seven genotypic correlation coefficients named pipes can be used for the I/O redirection.
mkfifo map;
mkfifo pro;
samtools mpileup -q 5 -Q 5 -B population1.sort.bam population2.sort.bam
| mapgd proview -H seq1.header | tee pro | mapgd allele | mapgd filter -p 22 -E 0.01 -c 50 -C 200 > map &;
mapgd genotype -p pro -m map | mapgd relatedness > population-rel.out
And linkage disequilibrium can be calculated in a similar manner:
mkfifo map;
mkfifo pro;
samtools mpileup -q 5 -Q 5 -B population1.sort.bam population2.sort.bam
| mapgd proview -H seq1.header | tee pro | mapgd allele | mapgd filter -p 22 -E 0.01 -c 50 -C 200 > map &;
mapgd linkage -p pro -m map > population-lnk.out
<h2> FAQ </h2>
<b> Why don't you provide information on indels? </b>
We currently do not have a likelihood model to account for errors in calling indels, so we cannot incorporate indels into our program at this time. This is on our TODO list, but may not occur for some time.
<b> How long does it take to run? </b>
Typical benchmarks with 16 threads on a 2.6 GHz put us at around 18,000 sites a second for 96 simulated individuals at 10x coverage. This means that the typical invertebrate population will take around three hours to analyze on a good computer, and a vertebrate genome might take a few days. If you have managed to sequence Paris japonica you're looking at three months of computation time if you run it on a single computer. However, mapgd is designed to be used in a cluster computing environment, and can make use of multiple nodes to dramatically reduce computation time. Running mapgd on 96 individuals with 150 Gbp genomes should be possible if a large number of nodes (say 50) are used.
<b> Help, I can't get the program to compile. </b>
MAPGD requires a compiler that complies with the C++11 guidelines. If you have a C++11 compiler and mapgd will still not compile on your system, please e-mail me (Matthew Ackerman) so that I can work to correct the problem. I have compiled mapgd on:
-
Ubuntu Linux, 14.04
-
Red Hat Enterprise Linux 6.x
-
A version of the Cray Linux Environment.
-
OS X Yosemite.
To compile on OS X, you may need to type 'make noomp' because the default OS X does compiler does not support openmp. You may be able to obtain a compiler that supports openmp by typing:
brew install gcc --without-multilib
- Windows mapgd is available on windows systems as a pair of binaries (mapgd-win32 and mapgd-win64) in the bin directory. These files are cross compiled with mingw, and are not extensively tested, so use at your own peril. The relatedness command is unavailable to windows users at the current time.
<b> Help, the program keeps crashing/hanging </b>
The first thing you should do is e-mail me (for contact information type 'mapgd -h'). I will open a bug report and we can begin discussing how to fix the program.
<b> How can I help? </b>
We have lots to do, and need plenty of help!
-
Create an Issue.
If you have any problems at all using this code, please click the !Issues button on the right hand menu to make us aware of the problem. Whether you can't compile and run the program, or locate one of the plethora of typos in the documentation and help files, please let us know!
-
Write a script to evaluate the statistical performance of a command.
One of the most important task we have right now is to compare the performance of mapgd to other programs that are available and make sure that our program is as good as it can be. Whether it is comparison of computational efficiency or statistical accuracy, we need a script to assess it!
-
Anything you can think of!
If you can do any of these things (or anything else that you think might help), you can contribute to the project by typing:
git clone https://github.com/LynchLab/MAPGD/
This will create a clone of the repository that you can play around with on your own. Then, if you can make some change that might help you can mark the file to be changed by typing:
git add FILENAME
Then type a short comment describing the change you have made using the command
git commit -m "COMMENT"
Finally show me your changes by typing.
git push
<h2> Changes </h2>
LD should be fixed, but relatedness now always uses the home-grown Newton-Raphson. It should be a little more stable than before, but please pay attention to the PASS/FAIL flags.
<h3> Changes from 0.4 </h3>I'm breaking LD with this update.
Several flags have been changed for greater consistency between commands, binary flags are up and working, vcf output/input is up and working.
Likelihood equations for LD have been changed to account for departure from HWE.
<h3> Changes from 0.3 </h3>There have been a lot changes from 0.3. The format of input and output files has changed, and previous formats are no longer supported. The name of the 'ei' command has been changed to allele, and the 'ep' and 'cp' are now both part of the 'pooled' command. A standard file interface has been created (map-file) which handles
