Tabixpy
Tabix reader written 100% in Python
Install / Use
/learn @bejobioinformatics/TabixpyREADME
tabixpy
Tabix parser writtern in Python3.
Website
https://pypi.org/project/tabixpy/
Install
pip install tabixpy
Usage
import tabixpy
ingz = "example.vcf.gz"
data = tabixpy.readTabix(ingz)
tabixpy.save(data, ingz, compress=True)
JSON Schema
https://www.liquid-technologies.com/online-json-schema-validator
https://www.jsonschemavalidator.net/
Example output
{
"n_ref": 1,
"format": 2,
"col_seq": 1,
"col_beg": 2,
"col_end": 0,
"meta": "#",
"skip": 0,
"l_nm": 11,
"names": [
"SL2.50ch00"
],
"refs": [{
"ref_n": 0,
"ref_name": "SL2.50ch00",
"n_bin": 86,
"bins": [{
"bin_n": 0,
"bin": 4681,
"n_chunk": 1,
"chunks": {
"chunk_begin": [{
"bin_n": 0,
"chunk_n": 0,
"real": 0,
"bytes": 29542,
"block_len": 8031,
"bin_pos": 280,
"first_pos": 280,
"last_pos": 1506
}],
"chunk_end": [{
"bin_n": 0,
"chunk_n": 0,
"real": 124525,
"bytes": 19630,
"block_len": 9015,
"bin_pos": 16388,
"first_pos": 16141,
"last_pos": 17830
}]
},
"chunks_begin": {
"bin_n": 0,
"chunk_n": 0,
"real": 0,
"bytes": 29542,
"block_len": 8031,
"bin_pos": 280,
"first_pos": 280,
"last_pos": 1506
},
"chunks_end": {
"bin_n": 0,
"chunk_n": 0,
"real": 124525,
"bytes": 19630,
"block_len": 9015,
"bin_pos": 16388,
"first_pos": 16141,
"last_pos": 17830
}
},
{
"bin_n": 85,
"bin": 4766,
"n_chunk": 1,
"chunks": {
"chunk_begin": [{
"bin_n": 84,
"chunk_n": 0,
"real": 7021611,
"bytes": 4631,
"block_len": 6621,
"bin_pos": 1392700,
"first_pos": 1392519,
"last_pos": 1393974
}],
"chunk_end": [{
"bin_n": 85,
"chunk_n": 0,
"real": 7039684,
"bytes": 0,
"block_len": -1,
"bin_pos": -1,
"first_pos": -1,
"last_pos": -1
}]
},
"chunks_begin": {
"bin_n": 84,
"chunk_n": 0,
"real": 7021611,
"bytes": 4631,
"block_len": 6621,
"bin_pos": 1392700,
"first_pos": 1392519,
"last_pos": 1393974
},
"chunks_end": {
"bin_n": 85,
"chunk_n": 0,
"real": 7039684,
"bytes": 0,
"block_len": -1,
"bin_pos": -1,
"first_pos": -1,
"last_pos": -1
}
}
],
"bins_begin": {
"bin_n": 0,
"chunk_n": 0,
"real": 0,
"bytes": 29542,
"block_len": 8031,
"bin_pos": 280,
"first_pos": 280,
"last_pos": 1506
},
"bins_end": {
"bin_n": 84,
"chunk_n": 0,
"real": 7021611,
"bytes": 4631,
"block_len": 6621,
"bin_pos": 1392700,
"first_pos": 1392519,
"last_pos": 1393974
},
"first_block": {
"bin_n": 0,
"chunk_n": 0,
"real": 0,
"bytes": 29542,
"block_len": 8031,
"bin_pos": 280,
"first_pos": 280,
"last_pos": 1506
},
"last_block": {
"bin_n": -1,
"chunk_n": -1,
"real": 7035849,
"bytes": 0,
"block_len": 3835,
"bin_pos": -1,
"first_pos": 1395124,
"last_pos": 1395638
},
"n_intv": 86,
"intvs": [{
"bin_n": 0,
"chunk_n": 0,
"real": 0,
"bytes": 29542,
"block_len": 8031,
"bin_pos": 280,
"first_pos": 280,
"last_pos": 1506
},
{
"bin_n": 84,
"chunk_n": 0,
"real": 7021611,
"bytes": 4631,
"block_len": 6621,
"bin_pos": 1392700,
"first_pos": 1392519,
"last_pos": 1393974
}
]
}],
"n_no_coor": null,
"__format_name__": "TBJ",
"__format_ver__": 5
}
Timming
2020-06-12 11:30:34,716 - tabixpy - INFO - reading annotated_tomato_150.100000.vcf.gz
2020-06-12 11:30:34,738 - tabixpy - INFO - saving annotated_tomato_150.100000.vcf.gz.tbj
,024
2020-06-12 11:31:16,506 - tabixpy - INFO - reading annotated_tomato_150.vcf.bgz
2020-06-12 11:31:24,152 - tabixpy - INFO - saving annotated_tomato_150.vcf.bgz.tbj
8,646
File Sizes
TBI Tabix Index
TBK Binary TabixPy 'all chunks' index
TBJ Compressed JSON Tabix index
TBJ.json Uncompressed JSON Tabix Index
6.8M annotated_tomato_150.100000.vcf.gz
1.1K annotated_tomato_150.100000.vcf.gz.tbi
5.9K annotated_tomato_150.100000.vcf.gz.tbk 6.5X
8.9K annotated_tomato_150.100000.vcf.gz.tbj 8.1X
104K annotated_tomato_150.100000.vcf.gz.tbj.json 94.5X
44M annotated_tomato_150.SL2.50ch00-01-02.vcf.gz
4.3K annotated_tomato_150.SL2.50ch00-01-02.vcf.gz.tbi
40K annotated_tomato_150.SL2.50ch00-01-02.vcf.gz.tbk 9.3X
39K annotated_tomato_150.SL2.50ch00-01-02.vcf.gz.tbj 9.1X
468K annotated_tomato_150.SL2.50ch00-01-02.vcf.gz.tbj.json 108.3X
5.6G annotated_tomato_150.vcf.bgz
727K annotated_tomato_150.vcf.bgz.tbi
Tabix
https://samtools.github.io/hts-specs/tabix.pdf
Field Description Type Value
---------------------------------------------------------------------------------------
magic Magic string char[4] TBI 1
n_ref # sequences int32_t
format Format (0: generic; 1: SAM; 2: VCF) int32_t
col_seq Column for the sequence name int32_t
col_beg Column for the start of a region int32_t
col_end Column for the end of a region int32_t
meta Leading character for comment lines int32_t
skip # lines to skip at the beginning int32_t
l_nm Length of concatenated sequence names int32_t
names Concatenated names, each zero terminated char[l_nm]
======================= List of indices (n=n_ref ) =======================
n_bin # distinct bins (for the binning index) int32_t
======================= List of distinct bins (n=n_bin) =======================
bin Distinct bin number uint32_t
n_chunk # chunks int32_t
======================= List of chunks (n=n_chunk) =======================
cnk_beg Virtual file offset of the start of the chunk uint64_t
cnk_end Virtual file offset of the end of the chunk uint64_t
n_intv # 16kb intervals (for the linear index) int32_t
======================= List of distinct intervals (n=n_intv) =======================
ioff File offset of the first record in the interval uint64_t
n_no_coor (optional) # unmapped reads without coordinates set uint64_t
Notes
-
The index file is BGZF compressed.
-
All integers are little-endian.
-
When (format&0x10000) is true, the coordinate follows the BED rule (i.e. half-closed-half-open and zero based); otherwise, the coordinate follows the GFF rule (closed and one based).
-
For the SAM format, the end of a region equals POS plus the reference length in the alignment, inferred from CIGAR. For the VCF format, the end of a region equals POS plus the size of the deletion.
-
Field col beg may equal col end, and in this case, the end of a region is end=beg+1.
-
Example:
- For GFF, format=0 , col seq=1, col beg=4, col end=5, meta=‘#’ and skip=0.
- For BED, format=0x10000, col seq=1, col beg=2, col end=3, meta=‘#’ and skip=0.
-
Given a zero-based, half-closed and half-open region [beg, end), the bin number is calculated with the following C function:
int reg2b
Related Skills
node-connect
345.4kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
104.6kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
345.4kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
345.4kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
