IBDGem
Program for positive genetic identification and IBD detection from low-coverage sequencing data
Install / Use
/learn @Paleogenomics/IBDGemREADME
IBDGem
IBDGem is an identity analysis tool designed to work with low-coverage sequencing data. The program compares sequence information from a poor sample (such as a forensic or ancient specimen) to genotype information from one or more samples generated independently via deep sequencing or microarrays. At each biallelic SNP, IBDGem calculates the probability of observing the sequencing data given that they come from an individual who has 0, 1, or 2 identical-by-descent chromosomes with the person providing the genotypes. In other words, the program evaluates the likelihood that the genotypes' source individual could also have generated the DNA sample of interest.
Table of Contents
- Installation
- Usage
- Example run
- Main program (ibdgem.c)
- Estimating IBD proportions for relatedness detection (hiddengem.c)
- Auxiliary files
Installation
To compile
After downloading the latest package from Releases, extract the source code file. Then, navigate to the resulting IBDGem directory and type:
make
Add the IBDGem directory to your $PATH with:
export PATH=$(pwd):$PATH
Separate modules can also be compiled individually by typing:
make [module_name]
For example, to compile the hiddengem module, type:
make hiddengem
To remove object files and executables
In the IBDGem main directory, type:
make clean
Usage
Input
The main program (ibdgem.c) calculates likelihoods of IBD states at each SNP. Two types of input are required:
- Pileup file containing sequence information of the unidentified sample.
- A VCF or 3 files in the IMPUTE reference-panel format (with extensions
.hap,.legend, and.indv) containing genotype data from a single or multiple test individuals.
The Pileup file can be generated from a BAM file with samtools:
samtools mpileup --output-MQ --output [out.pileup] [in.bam]
Using IMPUTE format has the advantage of being able to first filter genotypes by various metrics through vcftools before inputting to IBDGem if desired. The program also runs faster on IMPUTE files. The 3 IMPUTE files can be generated from a VCF file with:
vcftools [ --vcf [in.vcf] | --gzvcf [in.vcf.gz] ] \
--max-alleles 2 --min-alleles 2 \
--max-missing 1 \
--out [out-prefix] \
--IMPUTE
If, instead of a VCF or IMPUTE files, you have a general format genotype file with 4 columns:
rsID, chrom, allele1, allele2, you can use the gt2vcf.py script included in this repository to
first convert the genotype file to VCF format (see the Auxiliary files section).
Linkage disequilibrium (LD) mode
Starting from version 2.0, IBDGem provides an option (--LD) to take linkage disequilibrium among alleles
into account when calculating the likelihood of the IBD0 and IBD1 models. To do this, the program uses phased
genotypes from a reference set of samples, which consists of either all samples from the VCF/IMPUTE files
(default) or a specific subset of those samples specified via --background-list.
For IBD0, IBDGem will then compare the Pileup data against the genotypes of these background individuals and
take the average to be the likelihood of the data under the IBD0 model, over a genomic segment (determined via
--window-size). For IBD1, IBDGem creates a pseudo diploid genotype by combining one haplotype from the
target individual and one haplotype from each individual in the reference panel, over a genomic segment, then
compares the Pileup data against these pseudo genotypes, taking the average over all possible pseudo genotype
combinations (4 combinations per reference individual) to be the likelihood of the IBD1 model under LD.
The program can be run by customizing the general command:
ibdgem [--LD] -H [hap-file] -L [legend-file] -I [indv-file] -P [pileup-file] [other options...]
Or:
ibdgem [--LD] -V [vcf-file] -P [pileup-file] [other options...]
Important Notes:
-
IBDGem is designed to be run on data from one chromosome at a time. If your VCF contains multiple chromosomes, it should be split into single-chromosome files (e.g., using
vcftoolswith--chroption) before being used as input. Similarly, IMPUTE-format files should also be single-chromosome. The Pileup file doesn't need to be divided, simply specify the chromosome you want to run on via the--chromosome/-coption in the IBDGem command. -
IBDGem will skip sites where the genotype is missing (
./.) for any of the samples in the genotype input. To maximize the number of compared sites per sample, either impute the missing sites or subset the VCF by sample, then supply a separate allele frequency file with 3 columns:CHR,POS,FREQ(no header) through the--allele-freqs/-Aoption. This is so that the program can use the provided allele frequencies in its calculations instead of having to estimate them from the genotypes in the VCF. -
By default, IBDGem will infer allele frequencies using the genotypes of all individuals in the VCF/IMPUTE files. This, however, can lead to decreased accuracy in likelihood calculation if the number of individuals is small (i.e. fewer than 50). Thus, it is recommended in this case that the user provides allele frequencies calculated from a larger reference panel (such as the 1000 Genomes) in a separate file via the
--allele-freqsoption. This is important when running the program under the regular, non-LD mode, where likelihoods of the IBD0 & IBD1 models are calculated on a per-site basis rather than per-haplotype and are thus more dependent on allele frequencies. Similarly, in LD mode, it is important to provide at least 50 samples as reference genotypes to accurately model the IBD0 & IBD1 states. -
When running IBDGem under LD mode, it is important to make sure that the genotype individuals that are used as background are unrelated to each other and to the Pileup individual. If there is any relatedness, the IBD0 likelihoods will be inflated and LLR(IBD2/IBD0) will be reduced. In the case that you have a VCF file with several subject individuals that you want to compare the Pileup data to and a number of reference individuals that you want to use for background model calculation, you can explicitly specify the list of subject individuals via
--sample-listand the background individuals via--background-list. On the other hand, if you have only one subject individual and multiple reference individuals in the VCF, and the Pileup data is suspected to be from that one subject individual, you can set the Pileup sample name to be the same as the VCF ID of the subject individual via--pileup-name, and the program will automatically use all other samples in the VCF as background, without having to explicitly specify them with--background-list. -
For relatedness detection under LD mode, it is required that the genotypes in the reference panel and of the target individual are phased since the calculation of IBD1 involves combining phased haplotypes. This, however, is not necessary for direct comparison cases (self-vs-self or self-vs-random) as IBD0 calculation under LD does not use phased information.
-
In converting between VCF and IMPUTE, because the
--IMPUTEargument invcftoolsrequires phased data, but IBDGem does not need phase information, one can superficially modify the VCF to change the genotype notation (A0/A1toA0|A1) with the bash command:
sed "/^##/! s/\//|/g" unphased.vcf > mockphased.vcf
The resulting VCF file can then be converted to IMPUTE normally with vcftools --IMPUTE.
Output
IBDGem generates 2 tab-delimited files:
- Table file (*.tab.txt) with information about each site in the following fields:
CHR, rsID, POS, REF, ALT, AF, DP, SQ_NREF, SQ_NALT, GT_A0, GT_A1, LIBD0, LIBD1, LIBD2.
Each row corresponds to a single SNP, and the last 3 columns correspond to the likelihoods
of model IBD0, IBD1, and IBD2, respectively (these likelihoods are NOT calculated under
LD mode, even with
--LDoption specified). - Summary file (*.summary.txt) with information about each genomic segment in the
following fields: SEGMENT, START, END, LIBD0, LIBD1, LIBD2, NUM_SITES. START and
END correspond to the physical coordinates of the first and last SNP in the segment,
respectively. NUM_SITES corresponds to the number of SNPs within the segment over
which the likelihoods for models IBD0, IBD1, and IBD2 are aggregated, which can be set using
--window-sizewhen running IBDGem (these likelihoods would be calculated under LD mode if--LDoption was specified).
From these files, the log-likelihood ratio (LLR) between any two models at any given site/segment can be calculated. For example, in the special case of determining whether the sequence data derives from the same individual as the genotype data versus the model of it coming from an unrelated individual, we simply generate LLRs between the IBD2 and IBD0 models.
Example run
In the supplementary directory, you will find the ibdgem-test folder with a test suite of inputs
and outputs for test-running the program. The IMPUTE files contain genotype data at 200 SNP sites for
3 samples: sample1, sample2, and sample3. The test1.pileup, test2.pileup,
and test3.pileup files contain the corresponding sequence data.
To perform a te
Related Skills
node-connect
335.2kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
82.5kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
335.2kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
82.5kCommit, push, and open a PR
