Bgt

Flexible genotype query among 30,000+ samples whole-genome

Generate Convert Improve

Install / Use

/learn @lh3/Bgt

About this skill

Quality Score

0/100

README

<a name="started"></a>Getting Started

Connect to a public BGT server

curl -s 'http://bgtdemo.herokuapp.com/'
curl -s 'http://bgtdemo.herokuapp.com/?a=(impact=="HIGH")&s=(population=="FIN")&f=(AC>0)'
curl -s 'http://bgtdemo.herokuapp.com/?t=CHROM,POS,END,REF,ALT,AC/AN&f=(AC>1)&r=20'

For the last query, the last line is "*", indicating the result is incomplete. Note that this web app is using Heroku's free tier. It is restricted to one CPU only and put to sleep when the app is idle. There is an overhead of wakeup. Heroku also forces free apps to sleep for "6 hours in a 24 hour period". I don't know how exactly this works.

Run BGT locally

# Installation
git clone https://github.com/lh3/bgt.git
cd bgt; make
# Download demo BCF (1st 1Mbp of chr11 from 1000g), and convert to BGT
wget -O- http://bit.ly/BGTdemo | tar xf -
./bgt import 1kg11-1M.bgt 1kg11-1M.raw.bcf
gzip -dc 1kg11-1M.raw.samples.gz > 1kg11-1M.bgt.spl  # sample meta data
# Get all sample genotypes
./bgt view -C 1kg11-1M.bgt | less -S
# Get genotypes of HG00171 and HG00173 in region 11:100,000-200,000
./bgt view -s,HG00171,HG00173 -f'AC>0' -r 11:100000-200000 1kg11-1M.bgt
# Get alleles high-frequency in CEU but absent from YRI
./bgt view -s'population=="CEU"' -s'population=="YRI"' -f'AC1/AN1>=0.1&&AC2==0' -G 1kg11-1M.bgt
# Select high-impact sites (var annotation provided with -d)
./bgt view -d anno11-1M.fmf.gz -a'impact=="HIGH"' -CG 1kg11-1M.bgt

Set up your web server

# Compile the server; Go compiler required
make bgt-server
GOMAXPROCS=4 ./bgt-server -d anno11-1M.fmf.gz 1kg11-1M.bgt 2> server.log &
curl -s '0.0.0.0:8000' | less -S  # help
curl -s '0.0.0.0:8000/?a=(impact=="HIGH")&s=(population=="FIN")&f=(AC>0)'

Getting Started
Users' Guide
Further Notes
- Other genotype formats
- Performance evaluation

<a name="guide"></a>Users' Guide

BGT is a compact file format for efficiently storing and querying whole-genome genotypes of tens to hundreds of thousands of samples. It can be considered as an alternative to genotype-only BCFv2. BGT is more compact in size, more efficient to process, and more flexible on query.

BGT comes with a command line tool and a web application which largely mirrors the command line uses. The tool supports expressive and powerful query syntax. The "Getting Started" section shows a few examples.

<a name="model"></a>1. Data model overview

BGT models a genotype data set as a matrix of genotypes with rows indexed by site and columns by sample. Each BGT database keeps a genetype matrix and a sample annotation file. Site annotations are kept in a separate file which is intended to be shared across multiple BGT databases. This model is different from VCF in that VCF 1) keeps sample information in the header and 2) stores site annotations in INFO along with genotypes which are not meant to be shared across VCFs.

<a name="import"></a>2. Import

A BGT database always has a genotype matrix and sample names, which are acquired from VCF/BCF. Site annotations and sample phenotypes are optional but are recommended. Flexible meta data query is a distinguishing feature of BGT.

<a name="igenotype"></a>2.1 Import genotypes

# Import BCFv2
bgt import prefix.bgt in.bcf
# Import VCF with "##contig" header lines
bgt import -S prefix.bgt in.vcf.gz
# Import VCF without "##contig" header lines
bgt import -St ref.fa.fai prefix.bgt in.vcf.gz

During import, BGT separates multiple alleles on one VCF line. It discards all INFO fields and FORMAT fields except GT. See section 2.3 about how to use variant annotations with BGT.

<a name="iphenotype"></a>2.2 Import sample phenotypes

After importing VCF/BCF, BGT generates prefix.bgt.spl text file, which for now only has one column of sample names. You can add pheotype data to this file in a format like (fields TAB-delimited):

sample1   gender:Z:M    height:f:1.73     region:Z:WestEurasia     foo:i:10
sample2   gender:Z:F    height:f:1.64     region:Z:WestEurasia     bar:i:20

where each meta annotation takes a format key:type:value with type being Z for a string, f for a real number and i for an integer. We call this format Flat Metadata Format or FMF in brief. You can get samples matching certain conditions with:

bgt fmf prefix.bgt.spl 'height>1.7&&region=="WestEurasia"'
bgt fmf prefix.bgt.spl 'mass/height**2>25&&region=="WestEurasia"'

You can most common arithmetic and logical operators in the condition.

<a name="isite"></a>2.3 Import site annotations

Site annotations are also kept in a FMF file like:

11:209621:1:T  effect:Z:missense_variant   gene:Z:RIC8A  CCDS:Z:CCDS7690.1  CDSpos:i:347
11:209706:1:T  effect:Z:synonymous_variant gene:Z:RIC8A  CCDS:Z:CCDS7690.1  CDSpos:i:432

We provide a script misc/vep2fmf.pl to convert the VEP output with the --pick option to FMF.

Note that due to an implementation limitation, we recommend to use a subset of "important" variants with BGT, for example:

gzip -dc vep-all.fmf.gz | grep -v "effect:Z:MODIFIER" | gzip > vep-important.fmf.gz

Using the full set of variants is fine, but is much slower with the current implementation.

<a name="query"></a>3. Query

A BGT query is composed of output and conditions. The output is VCF by default or can be a TAB-delimited table if requsted. Conditions include genotype-independent site selection with option -r and -a (e.g. variants in a region), genotype-independent sample selection with option -s (e.g. a list of samples), and genotype-dependent site selection with option -f (e.g. allele frequency among selected samples above a threshold). BGT has limited support of genotype-dependent sample selection (e.g. samples having an allele).

BGT has an important concept "sample group". On the command line, each option -s creates a sample group. The #-th option -s populates a pair of AC# and AN# aggregate variables. These variables can be used in output or genotype-dependent site selection.

<a name="givs"></a>3.1 Genotype-independent site selection

# Select by a region
bgt view -r 11:100,000-200,000 1kg11-1M.bgt > out.vcf
# Select by regions in a BED (BGT will read through the entire BGT)
bgt view -B regions.bed 1kg11-1M.bgt > out.vcf
# Select a list of alleles (if on same chr, use random access)
bgt view -a,11:151344:1:G,11:110992:AACTT:A,11:160513::G 1kg11-1M.bgt
# Select by annotations (-d specifies the site annotation database)
bgt view -d anno11-1M.fmf.gz -a'impact=="HIGH"' -CG 1kg11-1M.bgt

It should be noted that in the last command line, BGT will read through the entire annotation file to find the list of matching alleles. It may take several minutes if the site annotation files contains 100 million lines. That is why we recommend to use a subset of important alleles (section 2.3).

<a name="giss"></a>3.2 Genotype-independent sample selection

# Select a list of samples
bgt view -s,HG00171,HG00173 1kg11-1M.bgt
# Select by phenotypes (see also section 2.2)
bgt view -s'population=="CEU"' 1kg11-1M.bgt
# Create sample groups (there will be AC1/AN1 and AC2/AN2 in VCF INFO)
bgt view -s'population=="CEU"' -s'population=="YRI"' -G 1kg11-1M.bgt

<a name="gdvs"></a>3.3 Genotype-dependent site selection

# Select by allele frequency
bgt view -f'AN>0&&AC/AN>.05' 1kg11-1M.bgt
# Select by group frequnecy
bgt view -s'population=="CEU"' -s'population=="YRI"' -f'AC1>10&&AC2==0' -G 1kg11-1M.bgt

Of course, we can mix all the three types of conditions in one command line:

bgt view -G -s'population=="CEU"' -s'population=="YRI"' -f'AC1/AN1>.1&&AC2==0' \
         -r 11:100,000-500,000 -d anno11-1M.fmf.gz -a'CDSpos>0' 1kg11-1M.bgt

<a name="tabout"></a>3.4 Tabular output

# Output position, sequence and allele counts
bgt view -t CHROM,POS,REF,ALT,AC1,AC2 -s'population=="CEU"' -s'population=="YRI"' 1kg11-1M.bgt

<a name="miscout"></a>3.5 Miscellaneous output

# Get samples having a set of alleles (option -S)
bgt view -S -a,11:151344:1:G,11:110992:AACTT:A,11:160513::G -s'population=="CEU"' 1kg11-1M.bgt
# Count haplotypes
bgt view -Hd anno11-1M.fmf.gz -a'gene=="SIRT3"' -f 'AC/AN>.01' 1kg11-1M.bgt
# Count haplotypes in multiple populations
bgt view -Hd anno11-1M.fmf.gz -a'gene=="SIRT3"' -f 'AC/AN>.01' \
         -s'region=="Africa"' -s'region=="EastAsia"' 1kg11-1M.bgt

<a name="server"></a>4. BGT server

In addition to a command line tool, we also provide a prototype web application for genotype query. The query syntax is similar to bgt view as is shown in "Getting Started", but with some notable differences:

The server uses .and. for the logical AND operator && (as & is a special character to HTML).
The server can't load a list of samples from a local file (for security).
The server doesn't support BCF output for now (can be implemented on request).
The server doesn't output genotypes by default (option g required for server).
The server loads site annotations into RAM (for real-time response but requiring more memory).
By default (tunable), the server processes up to 10 million genotypes and then truncates the result.
The server may forbid the output of genotypes of some samples (see below).

<a name="privacy"></a>4.1 Privacy

The BGT server implements a simple mechanism to keep the privacy of samples or a subset of samples. It i

Related Skills

node-connect

353.3k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

111.7k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

353.3k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

353.3k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。

lh3

View profile

View on GitHub

GitHub Stars94

CategoryDevelopment

Updated7mo ago

Forks11

lh3/bgt

Languages

Security Score

92/100

Audited on Sep 9, 2025

No findings

Bgt

Install / Use

README

<a name="started"></a>Getting Started

Connect to a public BGT server

Run BGT locally

Set up your web server

Table of Contents

<a name="guide"></a>Users' Guide

<a name="model"></a>1. Data model overview

<a name="import"></a>2. Import

<a name="igenotype"></a>2.1 Import genotypes

<a name="iphenotype"></a>2.2 Import sample phenotypes

<a name="isite"></a>2.3 Import site annotations

<a name="query"></a>3. Query

<a name="givs"></a>3.1 Genotype-independent site selection

<a name="giss"></a>3.2 Genotype-independent sample selection

<a name="gdvs"></a>3.3 Genotype-dependent site selection

<a name="tabout"></a>3.4 Tabular output

<a name="miscout"></a>3.5 Miscellaneous output

<a name="server"></a>4. BGT server

<a name="privacy"></a>4.1 Privacy

Related Skills