Tabixpy

Tabix reader written 100% in Python

Generate Convert Improve

Install / Use

/learn @bejobioinformatics/Tabixpy

About this skill

Quality Score

0/100

README

tabixpy

Tabix parser writtern in Python3.

Website

https://pypi.org/project/tabixpy/

Install

pip install tabixpy

Usage

import tabixpy
ingz = "example.vcf.gz"
data = tabixpy.readTabix(ingz)
tabixpy.save(data, ingz, compress=True)

JSON Schema

https://jsonschema.net/home

https://www.liquid-technologies.com/online-json-schema-validator

https://www.jsonschemavalidator.net/

Example output

{
    "n_ref": 1,
    "format": 2,
    "col_seq": 1,
    "col_beg": 2,
    "col_end": 0,
    "meta": "#",
    "skip": 0,
    "l_nm": 11,
    "names": [
        "SL2.50ch00"
    ],
    "refs": [{
        "ref_n": 0,
        "ref_name": "SL2.50ch00",
        "n_bin": 86,
        "bins": [{
                "bin_n": 0,
                "bin": 4681,
                "n_chunk": 1,
                "chunks": {
                    "chunk_begin": [{
                        "bin_n": 0,
                        "chunk_n": 0,
                        "real": 0,
                        "bytes": 29542,
                        "block_len": 8031,
                        "bin_pos": 280,
                        "first_pos": 280,
                        "last_pos": 1506
                    }],
                    "chunk_end": [{
                        "bin_n": 0,
                        "chunk_n": 0,
                        "real": 124525,
                        "bytes": 19630,
                        "block_len": 9015,
                        "bin_pos": 16388,
                        "first_pos": 16141,
                        "last_pos": 17830
                    }]
                },
                "chunks_begin": {
                    "bin_n": 0,
                    "chunk_n": 0,
                    "real": 0,
                    "bytes": 29542,
                    "block_len": 8031,
                    "bin_pos": 280,
                    "first_pos": 280,
                    "last_pos": 1506
                },
                "chunks_end": {
                    "bin_n": 0,
                    "chunk_n": 0,
                    "real": 124525,
                    "bytes": 19630,
                    "block_len": 9015,
                    "bin_pos": 16388,
                    "first_pos": 16141,
                    "last_pos": 17830
                }
            },
            {
                "bin_n": 85,
                "bin": 4766,
                "n_chunk": 1,
                "chunks": {
                    "chunk_begin": [{
                        "bin_n": 84,
                        "chunk_n": 0,
                        "real": 7021611,
                        "bytes": 4631,
                        "block_len": 6621,
                        "bin_pos": 1392700,
                        "first_pos": 1392519,
                        "last_pos": 1393974
                    }],
                    "chunk_end": [{
                        "bin_n": 85,
                        "chunk_n": 0,
                        "real": 7039684,
                        "bytes": 0,
                        "block_len": -1,
                        "bin_pos": -1,
                        "first_pos": -1,
                        "last_pos": -1
                    }]
                },
                "chunks_begin": {
                    "bin_n": 84,
                    "chunk_n": 0,
                    "real": 7021611,
                    "bytes": 4631,
                    "block_len": 6621,
                    "bin_pos": 1392700,
                    "first_pos": 1392519,
                    "last_pos": 1393974
                },
                "chunks_end": {
                    "bin_n": 85,
                    "chunk_n": 0,
                    "real": 7039684,
                    "bytes": 0,
                    "block_len": -1,
                    "bin_pos": -1,
                    "first_pos": -1,
                    "last_pos": -1
                }
            }
        ],
        "bins_begin": {
            "bin_n": 0,
            "chunk_n": 0,
            "real": 0,
            "bytes": 29542,
            "block_len": 8031,
            "bin_pos": 280,
            "first_pos": 280,
            "last_pos": 1506
        },
        "bins_end": {
            "bin_n": 84,
            "chunk_n": 0,
            "real": 7021611,
            "bytes": 4631,
            "block_len": 6621,
            "bin_pos": 1392700,
            "first_pos": 1392519,
            "last_pos": 1393974
        },
        "first_block": {
            "bin_n": 0,
            "chunk_n": 0,
            "real": 0,
            "bytes": 29542,
            "block_len": 8031,
            "bin_pos": 280,
            "first_pos": 280,
            "last_pos": 1506
        },
        "last_block": {
            "bin_n": -1,
            "chunk_n": -1,
            "real": 7035849,
            "bytes": 0,
            "block_len": 3835,
            "bin_pos": -1,
            "first_pos": 1395124,
            "last_pos": 1395638
        },
        "n_intv": 86,
        "intvs": [{
                "bin_n": 0,
                "chunk_n": 0,
                "real": 0,
                "bytes": 29542,
                "block_len": 8031,
                "bin_pos": 280,
                "first_pos": 280,
                "last_pos": 1506
            },
            {
                "bin_n": 84,
                "chunk_n": 0,
                "real": 7021611,
                "bytes": 4631,
                "block_len": 6621,
                "bin_pos": 1392700,
                "first_pos": 1392519,
                "last_pos": 1393974
            }
        ]
    }],
    "n_no_coor": null,
    "__format_name__": "TBJ",
    "__format_ver__": 5
}

Timming

2020-06-12 11:30:34,716 - tabixpy -   INFO - reading annotated_tomato_150.100000.vcf.gz
2020-06-12 11:30:34,738 - tabixpy -   INFO - saving  annotated_tomato_150.100000.vcf.gz.tbj
                   ,024
2020-06-12 11:31:16,506 - tabixpy -   INFO - reading annotated_tomato_150.vcf.bgz
2020-06-12 11:31:24,152 - tabixpy -   INFO - saving  annotated_tomato_150.vcf.bgz.tbj
                  8,646

File Sizes

TBI Tabix Index

TBK Binary TabixPy 'all chunks' index

TBJ Compressed JSON Tabix index

TBJ.json Uncompressed JSON Tabix Index

6.8M annotated_tomato_150.100000.vcf.gz
1.1K annotated_tomato_150.100000.vcf.gz.tbi
5.9K annotated_tomato_150.100000.vcf.gz.tbk                  6.5X
8.9K annotated_tomato_150.100000.vcf.gz.tbj                  8.1X
104K annotated_tomato_150.100000.vcf.gz.tbj.json            94.5X

 44M annotated_tomato_150.SL2.50ch00-01-02.vcf.gz
4.3K annotated_tomato_150.SL2.50ch00-01-02.vcf.gz.tbi
 40K annotated_tomato_150.SL2.50ch00-01-02.vcf.gz.tbk        9.3X
 39K annotated_tomato_150.SL2.50ch00-01-02.vcf.gz.tbj        9.1X
468K annotated_tomato_150.SL2.50ch00-01-02.vcf.gz.tbj.json 108.3X

5.6G annotated_tomato_150.vcf.bgz
727K annotated_tomato_150.vcf.bgz.tbi

Tabix

https://samtools.github.io/hts-specs/tabix.pdf

Field                   Description                                     Type     Value
---------------------------------------------------------------------------------------
magic                   Magic string                                    char[4]  TBI 1
n_ref                   # sequences                                     int32_t
format                  Format (0: generic; 1: SAM; 2: VCF)             int32_t
col_seq                 Column for the sequence name                    int32_t
col_beg                 Column for the start of a region                int32_t
col_end                 Column for the end of a region                  int32_t
meta                    Leading character for comment lines             int32_t
skip                    # lines to skip at the beginning                int32_t
l_nm                    Length of concatenated sequence names           int32_t
names                   Concatenated names, each zero terminated        char[l_nm]
======================= List of indices (n=n_ref )            =======================
    n_bin               # distinct bins (for the binning index)         int32_t
======================= List of distinct bins (n=n_bin)       =======================
        bin             Distinct bin number                             uint32_t
        n_chunk         # chunks                                        int32_t
======================= List of chunks (n=n_chunk)            =======================
            cnk_beg     Virtual file offset of the start of the chunk   uint64_t
            cnk_end     Virtual file offset of the end of the chunk     uint64_t
    n_intv              # 16kb intervals (for the linear index)         int32_t
======================= List of distinct intervals (n=n_intv) =======================
        ioff            File offset of the first record in the interval uint64_t
n_no_coor (optional)    # unmapped reads without coordinates set        uint64_t

Notes

The index file is BGZF compressed.
All integers are little-endian.
When (format&0x10000) is true, the coordinate follows the BED rule (i.e. half-closed-half-open and zero based); otherwise, the coordinate follows the GFF rule (closed and one based).
For the SAM format, the end of a region equals POS plus the reference length in the alignment, inferred from CIGAR. For the VCF format, the end of a region equals POS plus the size of the deletion.
Field col beg may equal col end, and in this case, the end of a region is end=beg+1.
Example:
- For GFF, format=0 , col seq=1, col beg=4, col end=5, meta=‘#’ and skip=0.
- For BED, format=0x10000, col seq=1, col beg=2, col end=3, meta=‘#’ and skip=0.
Given a zero-based, half-closed and half-open region [beg, end), the bin number is calculated with the following C function:

int reg2b

Related Skills

node-connect

345.4k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

104.6k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

345.4k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

345.4k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。