IBDGem

Program for positive genetic identification and IBD detection from low-coverage sequencing data

Generate Convert Improve

Install / Use

/learn @Paleogenomics/IBDGem

About this skill

Quality Score

0/100

README

IBDGem

IBDGem is an identity analysis tool designed to work with low-coverage sequencing data. The program compares sequence information from a poor sample (such as a forensic or ancient specimen) to genotype information from one or more samples generated independently via deep sequencing or microarrays. At each biallelic SNP, IBDGem calculates the probability of observing the sequencing data given that they come from an individual who has 0, 1, or 2 identical-by-descent chromosomes with the person providing the genotypes. In other words, the program evaluates the likelihood that the genotypes' source individual could also have generated the DNA sample of interest.

Installation
Usage
Example run
Main program (ibdgem.c)
Estimating IBD proportions for relatedness detection (hiddengem.c)
Auxiliary files

Installation

To compile

After downloading the latest package from Releases, extract the source code file. Then, navigate to the resulting IBDGem directory and type:

make

Add the IBDGem directory to your $PATH with:

export PATH=$(pwd):$PATH

Separate modules can also be compiled individually by typing:

make [module_name]

For example, to compile the hiddengem module, type:

make hiddengem

To remove object files and executables

In the IBDGem main directory, type:

make clean

Usage

Input

The main program (ibdgem.c) calculates likelihoods of IBD states at each SNP. Two types of input are required:

Pileup file containing sequence information of the unidentified sample.
A VCF or 3 files in the IMPUTE reference-panel format (with extensions .hap, .legend, and .indv) containing genotype data from a single or multiple test individuals.

The Pileup file can be generated from a BAM file with samtools:

samtools mpileup --output-MQ --output [out.pileup] [in.bam]

Using IMPUTE format has the advantage of being able to first filter genotypes by various metrics through vcftools before inputting to IBDGem if desired. The program also runs faster on IMPUTE files. The 3 IMPUTE files can be generated from a VCF file with:

vcftools  [ --vcf [in.vcf] | --gzvcf [in.vcf.gz] ] \
          --max-alleles 2 --min-alleles 2 \
          --max-missing 1 \
          --out [out-prefix] \
          --IMPUTE

If, instead of a VCF or IMPUTE files, you have a general format genotype file with 4 columns: rsID, chrom, allele1, allele2, you can use the gt2vcf.py script included in this repository to first convert the genotype file to VCF format (see the Auxiliary files section).

Linkage disequilibrium (LD) mode

Starting from version 2.0, IBDGem provides an option (--LD) to take linkage disequilibrium among alleles into account when calculating the likelihood of the IBD0 and IBD1 models. To do this, the program uses phased genotypes from a reference set of samples, which consists of either all samples from the VCF/IMPUTE files (default) or a specific subset of those samples specified via --background-list. For IBD0, IBDGem will then compare the Pileup data against the genotypes of these background individuals and take the average to be the likelihood of the data under the IBD0 model, over a genomic segment (determined via --window-size). For IBD1, IBDGem creates a pseudo diploid genotype by combining one haplotype from the target individual and one haplotype from each individual in the reference panel, over a genomic segment, then compares the Pileup data against these pseudo genotypes, taking the average over all possible pseudo genotype combinations (4 combinations per reference individual) to be the likelihood of the IBD1 model under LD.

The program can be run by customizing the general command:

ibdgem [--LD] -H [hap-file] -L [legend-file] -I [indv-file] -P [pileup-file] [other options...]

Or:

ibdgem [--LD] -V [vcf-file] -P [pileup-file] [other options...]

Important Notes:

IBDGem is designed to be run on data from one chromosome at a time. If your VCF contains multiple chromosomes, it should be split into single-chromosome files (e.g., using vcftools with --chr option) before being used as input. Similarly, IMPUTE-format files should also be single-chromosome. The Pileup file doesn't need to be divided, simply specify the chromosome you want to run on via the --chromosome/-c option in the IBDGem command.
IBDGem will skip sites where the genotype is missing (./.) for any of the samples in the genotype input. To maximize the number of compared sites per sample, either impute the missing sites or subset the VCF by sample, then supply a separate allele frequency file with 3 columns: CHR, POS, FREQ (no header) through the --allele-freqs/-A option. This is so that the program can use the provided allele frequencies in its calculations instead of having to estimate them from the genotypes in the VCF.
By default, IBDGem will infer allele frequencies using the genotypes of all individuals in the VCF/IMPUTE files. This, however, can lead to decreased accuracy in likelihood calculation if the number of individuals is small (i.e. fewer than 50). Thus, it is recommended in this case that the user provides allele frequencies calculated from a larger reference panel (such as the 1000 Genomes) in a separate file via the --allele-freqs option. This is important when running the program under the regular, non-LD mode, where likelihoods of the IBD0 & IBD1 models are calculated on a per-site basis rather than per-haplotype and are thus more dependent on allele frequencies. Similarly, in LD mode, it is important to provide at least 50 samples as reference genotypes to accurately model the IBD0 & IBD1 states.
When running IBDGem under LD mode, it is important to make sure that the genotype individuals that are used as background are unrelated to each other and to the Pileup individual. If there is any relatedness, the IBD0 likelihoods will be inflated and LLR(IBD2/IBD0) will be reduced. In the case that you have a VCF file with several subject individuals that you want to compare the Pileup data to and a number of reference individuals that you want to use for background model calculation, you can explicitly specify the list of subject individuals via --sample-list and the background individuals via --background-list. On the other hand, if you have only one subject individual and multiple reference individuals in the VCF, and the Pileup data is suspected to be from that one subject individual, you can set the Pileup sample name to be the same as the VCF ID of the subject individual via --pileup-name, and the program will automatically use all other samples in the VCF as background, without having to explicitly specify them with --background-list.
For relatedness detection under LD mode, it is required that the genotypes in the reference panel and of the target individual are phased since the calculation of IBD1 involves combining phased haplotypes. This, however, is not necessary for direct comparison cases (self-vs-self or self-vs-random) as IBD0 calculation under LD does not use phased information.
In converting between VCF and IMPUTE, because the --IMPUTE argument in vcftools requires phased data, but IBDGem does not need phase information, one can superficially modify the VCF to change the genotype notation (A0/A1 to A0|A1) with the bash command:

sed "/^##/! s/\//|/g" unphased.vcf > mockphased.vcf

The resulting VCF file can then be converted to IMPUTE normally with vcftools --IMPUTE.

Output

IBDGem generates 2 tab-delimited files:

Table file (*.tab.txt) with information about each site in the following fields: CHR, rsID, POS, REF, ALT, AF, DP, SQ_NREF, SQ_NALT, GT_A0, GT_A1, LIBD0, LIBD1, LIBD2. Each row corresponds to a single SNP, and the last 3 columns correspond to the likelihoods of model IBD0, IBD1, and IBD2, respectively (these likelihoods are NOT calculated under LD mode, even with --LD option specified).
Summary file (*.summary.txt) with information about each genomic segment in the following fields: SEGMENT, START, END, LIBD0, LIBD1, LIBD2, NUM_SITES. START and END correspond to the physical coordinates of the first and last SNP in the segment, respectively. NUM_SITES corresponds to the number of SNPs within the segment over which the likelihoods for models IBD0, IBD1, and IBD2 are aggregated, which can be set using --window-size when running IBDGem (these likelihoods would be calculated under LD mode if --LD option was specified).

From these files, the log-likelihood ratio (LLR) between any two models at any given site/segment can be calculated. For example, in the special case of determining whether the sequence data derives from the same individual as the genotype data versus the model of it coming from an unrelated individual, we simply generate LLRs between the IBD2 and IBD0 models.

Example run

In the supplementary directory, you will find the ibdgem-test folder with a test suite of inputs and outputs for test-running the program. The IMPUTE files contain genotype data at 200 SNP sites for 3 samples: sample1, sample2, and sample3. The test1.pileup, test2.pileup, and test3.pileup files contain the corresponding sequence data.

To perform a te

Related Skills

node-connect

335.2k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

82.5k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

335.2k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

commit-push-pr

82.5k

Commit, push, and open a PR

Paleogenomics

View profile

View on GitHub

GitHub Stars6

CategoryDevelopment

Updated10mo ago

Forks1

Paleogenomics/IBDGem

Languages

Security Score

67/100

Audited on May 7, 2025

No findings

IBDGem

Install / Use

README

IBDGem

Table of Contents

Installation

To compile

To remove object files and executables

Usage

Input

Linkage disequilibrium (LD) mode

Output

Example run

Related Skills