MuPBWT

A PBWT-based light index for UK Biobank scale genotype data.

Generate Convert Improve

Install / Use

/learn @dlcgold/MuPBWT

About this skill

Quality Score

0/100

README

μ-PBWT

A PBWT-based light index for UK Biobank scale genotype data.

Conda install

μ-PBWT is available for Gnu/Linux on conda (bioconda channel):

conda install -c bioconda mupbwt

Build from source

Prepare the cmake for building the current project in ‘.’ into the ‘build’ folder

cmake -S . -B build

Build μ-PBWT:

cmake --build build

Install from source (optional)

Install μ-PBWT (default in /usr/local/bin/, sudo required):

cmake --install build

Use --prefix <path> for custom path.

Usage

File format supported:

BCF/VCF
MaCS

cd build


Usage: ./mupbwt [options]

Options:
  -i, --input_file <path>	 vcf/bcf file for panel
  -s, --save <path>	  path to save index
  -l, --load <path>	 path to load index
  -o, --output <path>	 path to query output
  -q, --query <path>	 path to query file (vcf/bcf)
  -m, --macs	use macs as file format for both input and query file
  -v, --verbose	 extra prints
  -d, --details	 print memory usage details
  -h, --help	 show this help message and exit

Build the index:

./mupbwt -i <input file> -s <index file>

Query the index:

./mupbwt -l <index file> -q <query file> -o <output file>

Query without save the index:

./mupbwt -i <input file> -q <query file> -o <output file>

Query and save the index:

./mupbwt -i <input file> -s <index file> -q <query file> -o <output file>

Using examples in sample_data:

./mupbwt -i sample_data/panel.bcf -s sample_data/index.ser
./mupbwt -l sample_data/index.ser -q sample_data/query.bcf -o sample_data/sample_data_results 
./mupbwt -i sample_data/panel.bcf -q sample_data/query.bcf -o sample_data/sample_data_results
./mupbwt -i sample_data/panel.bcf -s sample_data/index.ser -q sample_data/query.bcf -o sample_data/sample_data_results

Load the index and print details to stdout:

./mupbwt -l <index file> -d

An output example is:

> ./mupbwt -l sample_data/index.ser -d
built/loaded in: 0.015628 s

----
Total haplotypes: 900
Total sites: 499
----
Total runs: 27512
Average runs: 55
----
run: 0.0386925 megabytes
thr: 0.0387306 megabytes
uv: 0.0380135 megabytes
samples: 0.0833178 megabytes
rlpbwt (mapping): 0.201148 megabytes
phi panels: 0.414757 megabytes
phi support: 0.126385 megabytes
phi data structure (panels + support): 0.541142 megabytes
rlpbwt: 0.74229 megabytes
----
estimated dense size: 36.4132 megabytes
----

Input

Only bialleic case is supported. In case of vcf/bcf bcftools can be used to filter the input:

bcftools view -m2 -M2  -v snps <input vcf/vcf> > <filtered vcf/bcf file>

Output

Output file follow the standard proposed in Durbin's PBWT. Each row contain a SMEM:

MATCH   <query index>   <row index> <staring column>    <ending column> <SMEM length>

For example:

MATCH	99	150	414	430	17

Row index and query index are incrementally so the name of the sample and the precise haplotype can be calculated using the output of bcftools.

The command:

bcftools query -l <input vcf/bcf> > samples.txt

store in samples.txt all the samples name, in order. So, for example, row indices 0 and 1 corresponds to the two haplotypes of the first sample, row indices 2 and 3 to the second one etc...

Eventually you can use script/mem_sample.py:

> python mem_sample.py -h
usage: mem_sample.py [-h] [-i INPUT] [-p PANEL] [-q QUERIES] [-o OUTPUT]

options:
  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        SMEM file in Durbin's format
  -p PANEL, --panel PANEL
                        panel as VCF/BCF (optional)
  -q QUERIES, --queries QUERIES
                        queries as VCF/BCF (optional)
  -o OUTPUT, --output OUTPUT
                        output file

Esample:

python mem_sample.py -i sample_data/sample_data_results -p sample_data/panel.bcf -q sample_data/query.bcf -o sample_data/sample_data_results_new

Only one between PANEL and QUERIES can be specified. New SMEM file will contain in each row:

MATCH   <querySample_haplotype>   <panelSample_haplotype> <staring column>    <ending column> <SMEM length>

For example, assuming both PANEL and QUERIES:

MATCH	1318026_1	4919834_0	414	430	17

Results

Results on high-coverage whole genome sequencing data from UK Biobank (chromosome 20):

| Region | #Samples | #Sites | Size BCF (GB) | μ-PBWT (GB) | Construction time (hh:mm) | Construction memory peak (GB) | |-------------------------|--------------|--------------|-------------------|-----------------|-------------------------------|-----------------------------------| | chr20:60061-4060065 | 150119 | 865267 | 1.9 | 0.88 | 06:25 | 2.27 | | chr20:4060066-8060066 | 150119 | 880899 | 2 | 0.85 | 06:28 | 2.22 | | chr20:8060067-12515479 | 150119 | 961591 | 2.1 | 0.77 | 07:04 | 2.05 | | chr20:12515480-16768988 | 150119 | 917468 | 2 | 0.73 | 06:47 | 1.97 | | chr20:16768989-21050967 | 150119 | 931010 | 2 | 0.71 | 06:53 | 1.92 | | chr20:21050968-31549151 | 150119 | 1919134 | 4.2 | 1.20 | 13:54 | 3.06 | | chr20:31549152-38282825 | 150119 | 1436549 | 2.8 | 0.99 | 10:25 | 2.63 | | chr20:38282826-43181963 | 150119 | 1056144 | 2.2 | 0.76 | 07:42 | 2.06 | | chr20:43181964-47619489 | 150119 | 955970 | 2 | 0.79 | 06:56 | 2.09 | | chr20:47619490-51789198 | 150119 | 923178 | 2 | 0.80 | 06:44 | 2.12 | | chr20:51789199-55789212 | 150119 | 911452 | 2 | 0.81 | 06:45 | 2.13 | | chr20:55789213-59874964 | 150119 | 925442 | 2 | 0.84 | 06:49 | 2.20 | | chr20:59874965-64334101 | 150119 | 1096089 | 2.4 | 0.93 | 08:00 | 2.42 | | Total | 150119 | 13780193 | 29.6 | 11.08 | - | 29.15 |

Results on 1000 Genome Project phase 3 data including the average number of runs per site:

| Chr | #Samples | #Sites | #Runs/site | Size BCF (GB) | μ-PBWT (GB) | Construction time (hh:mm) | Construction memory peak (GB) | |-----------|--------------|--------------|----------------|-------------------|-----------------|-------------------------------|-----------------------------------| | 1 | 2454 | 6196151 | 11 | 0.78 | 1.44 | 00:19 | 4.59 | | 2 | 2454 | 6786300 | 10 | 0.84 | 1.47 | 00:21 | 4.76 | | 3 | 2454 | 5584397 | 10 | 0.71 | 1.20 | 00:18 | 4.24 | | 4 | 2454 | 5480936 | 10 | 0.71 | 1.19 | 00:17 | 4.28 | | 5 | 2454 | 5037955 | 9 | 0.63 | 1.08 | 00:16 | 4.22 | | 6 | 2454 | 4800101 | 10 | 0.64 | 1.06 | 00:15 | 4.28 | | 7 | 2454 | 4517734 | 10 | 0.58 | 1.03 | 00:14 | 4.34 | | 8 | 2454 | 4417368 | 10 | 0.56 | 0.97 | 00:14 | 4.30 | | 9 | 2454 | 3414848 | 11 | 0.43 | 0.81 | 00:11 | 2.54 | | 10 | 2454 | 3823786 | 10 | 0.50 | 0.87 | 00:12 | 2.77 | | 11 | 2454 | 3877543 | 10 | 0.49 | 0.84

Related Skills

node-connect

347.6k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

108.4k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

347.6k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

347.6k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。