SkillAgentSearch skills...

Tabixpy

Tabix reader written 100% in Python

Install / Use

/learn @bejobioinformatics/Tabixpy
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

tabixpy

Upload Python Package

Tabix parser writtern in Python3.

Website

https://pypi.org/project/tabixpy/

Install

pip install tabixpy

Usage

import tabixpy
ingz = "example.vcf.gz"
data = tabixpy.readTabix(ingz)
tabixpy.save(data, ingz, compress=True)

JSON Schema

https://jsonschema.net/home

https://www.liquid-technologies.com/online-json-schema-validator

https://www.jsonschemavalidator.net/

Example output

{
    "n_ref": 1,
    "format": 2,
    "col_seq": 1,
    "col_beg": 2,
    "col_end": 0,
    "meta": "#",
    "skip": 0,
    "l_nm": 11,
    "names": [
        "SL2.50ch00"
    ],
    "refs": [{
        "ref_n": 0,
        "ref_name": "SL2.50ch00",
        "n_bin": 86,
        "bins": [{
                "bin_n": 0,
                "bin": 4681,
                "n_chunk": 1,
                "chunks": {
                    "chunk_begin": [{
                        "bin_n": 0,
                        "chunk_n": 0,
                        "real": 0,
                        "bytes": 29542,
                        "block_len": 8031,
                        "bin_pos": 280,
                        "first_pos": 280,
                        "last_pos": 1506
                    }],
                    "chunk_end": [{
                        "bin_n": 0,
                        "chunk_n": 0,
                        "real": 124525,
                        "bytes": 19630,
                        "block_len": 9015,
                        "bin_pos": 16388,
                        "first_pos": 16141,
                        "last_pos": 17830
                    }]
                },
                "chunks_begin": {
                    "bin_n": 0,
                    "chunk_n": 0,
                    "real": 0,
                    "bytes": 29542,
                    "block_len": 8031,
                    "bin_pos": 280,
                    "first_pos": 280,
                    "last_pos": 1506
                },
                "chunks_end": {
                    "bin_n": 0,
                    "chunk_n": 0,
                    "real": 124525,
                    "bytes": 19630,
                    "block_len": 9015,
                    "bin_pos": 16388,
                    "first_pos": 16141,
                    "last_pos": 17830
                }
            },
            {
                "bin_n": 85,
                "bin": 4766,
                "n_chunk": 1,
                "chunks": {
                    "chunk_begin": [{
                        "bin_n": 84,
                        "chunk_n": 0,
                        "real": 7021611,
                        "bytes": 4631,
                        "block_len": 6621,
                        "bin_pos": 1392700,
                        "first_pos": 1392519,
                        "last_pos": 1393974
                    }],
                    "chunk_end": [{
                        "bin_n": 85,
                        "chunk_n": 0,
                        "real": 7039684,
                        "bytes": 0,
                        "block_len": -1,
                        "bin_pos": -1,
                        "first_pos": -1,
                        "last_pos": -1
                    }]
                },
                "chunks_begin": {
                    "bin_n": 84,
                    "chunk_n": 0,
                    "real": 7021611,
                    "bytes": 4631,
                    "block_len": 6621,
                    "bin_pos": 1392700,
                    "first_pos": 1392519,
                    "last_pos": 1393974
                },
                "chunks_end": {
                    "bin_n": 85,
                    "chunk_n": 0,
                    "real": 7039684,
                    "bytes": 0,
                    "block_len": -1,
                    "bin_pos": -1,
                    "first_pos": -1,
                    "last_pos": -1
                }
            }
        ],
        "bins_begin": {
            "bin_n": 0,
            "chunk_n": 0,
            "real": 0,
            "bytes": 29542,
            "block_len": 8031,
            "bin_pos": 280,
            "first_pos": 280,
            "last_pos": 1506
        },
        "bins_end": {
            "bin_n": 84,
            "chunk_n": 0,
            "real": 7021611,
            "bytes": 4631,
            "block_len": 6621,
            "bin_pos": 1392700,
            "first_pos": 1392519,
            "last_pos": 1393974
        },
        "first_block": {
            "bin_n": 0,
            "chunk_n": 0,
            "real": 0,
            "bytes": 29542,
            "block_len": 8031,
            "bin_pos": 280,
            "first_pos": 280,
            "last_pos": 1506
        },
        "last_block": {
            "bin_n": -1,
            "chunk_n": -1,
            "real": 7035849,
            "bytes": 0,
            "block_len": 3835,
            "bin_pos": -1,
            "first_pos": 1395124,
            "last_pos": 1395638
        },
        "n_intv": 86,
        "intvs": [{
                "bin_n": 0,
                "chunk_n": 0,
                "real": 0,
                "bytes": 29542,
                "block_len": 8031,
                "bin_pos": 280,
                "first_pos": 280,
                "last_pos": 1506
            },
            {
                "bin_n": 84,
                "chunk_n": 0,
                "real": 7021611,
                "bytes": 4631,
                "block_len": 6621,
                "bin_pos": 1392700,
                "first_pos": 1392519,
                "last_pos": 1393974
            }
        ]
    }],
    "n_no_coor": null,
    "__format_name__": "TBJ",
    "__format_ver__": 5
}

Timming

2020-06-12 11:30:34,716 - tabixpy -   INFO - reading annotated_tomato_150.100000.vcf.gz
2020-06-12 11:30:34,738 - tabixpy -   INFO - saving  annotated_tomato_150.100000.vcf.gz.tbj
                   ,024
2020-06-12 11:31:16,506 - tabixpy -   INFO - reading annotated_tomato_150.vcf.bgz
2020-06-12 11:31:24,152 - tabixpy -   INFO - saving  annotated_tomato_150.vcf.bgz.tbj
                  8,646

File Sizes

TBI Tabix Index

TBK Binary TabixPy 'all chunks' index

TBJ Compressed JSON Tabix index

TBJ.json Uncompressed JSON Tabix Index

6.8M annotated_tomato_150.100000.vcf.gz
1.1K annotated_tomato_150.100000.vcf.gz.tbi
5.9K annotated_tomato_150.100000.vcf.gz.tbk                  6.5X
8.9K annotated_tomato_150.100000.vcf.gz.tbj                  8.1X
104K annotated_tomato_150.100000.vcf.gz.tbj.json            94.5X

 44M annotated_tomato_150.SL2.50ch00-01-02.vcf.gz
4.3K annotated_tomato_150.SL2.50ch00-01-02.vcf.gz.tbi
 40K annotated_tomato_150.SL2.50ch00-01-02.vcf.gz.tbk        9.3X
 39K annotated_tomato_150.SL2.50ch00-01-02.vcf.gz.tbj        9.1X
468K annotated_tomato_150.SL2.50ch00-01-02.vcf.gz.tbj.json 108.3X

5.6G annotated_tomato_150.vcf.bgz
727K annotated_tomato_150.vcf.bgz.tbi

Tabix

https://samtools.github.io/hts-specs/tabix.pdf

Field                   Description                                     Type     Value
---------------------------------------------------------------------------------------
magic                   Magic string                                    char[4]  TBI 1
n_ref                   # sequences                                     int32_t
format                  Format (0: generic; 1: SAM; 2: VCF)             int32_t
col_seq                 Column for the sequence name                    int32_t
col_beg                 Column for the start of a region                int32_t
col_end                 Column for the end of a region                  int32_t
meta                    Leading character for comment lines             int32_t
skip                    # lines to skip at the beginning                int32_t
l_nm                    Length of concatenated sequence names           int32_t
names                   Concatenated names, each zero terminated        char[l_nm]
======================= List of indices (n=n_ref )            =======================
    n_bin               # distinct bins (for the binning index)         int32_t
======================= List of distinct bins (n=n_bin)       =======================
        bin             Distinct bin number                             uint32_t
        n_chunk         # chunks                                        int32_t
======================= List of chunks (n=n_chunk)            =======================
            cnk_beg     Virtual file offset of the start of the chunk   uint64_t
            cnk_end     Virtual file offset of the end of the chunk     uint64_t
    n_intv              # 16kb intervals (for the linear index)         int32_t
======================= List of distinct intervals (n=n_intv) =======================
        ioff            File offset of the first record in the interval uint64_t
n_no_coor (optional)    # unmapped reads without coordinates set        uint64_t

Notes

  • The index file is BGZF compressed.

  • All integers are little-endian.

  • When (format&0x10000) is true, the coordinate follows the BED rule (i.e. half-closed-half-open and zero based); otherwise, the coordinate follows the GFF rule (closed and one based).

  • For the SAM format, the end of a region equals POS plus the reference length in the alignment, inferred from CIGAR. For the VCF format, the end of a region equals POS plus the size of the deletion.

  • Field col beg may equal col end, and in this case, the end of a region is end=beg+1.

  • Example:

    • For GFF, format=0 , col seq=1, col beg=4, col end=5, meta=‘#’ and skip=0.
    • For BED, format=0x10000, col seq=1, col beg=2, col end=3, meta=‘#’ and skip=0.
  • Given a zero-based, half-closed and half-open region [beg, end), the bin number is calculated with the following C function:

int reg2b

Related Skills

View on GitHub
GitHub Stars8
CategoryDevelopment
Updated1y ago
Forks0

Languages

Python

Security Score

60/100

Audited on Sep 13, 2024

No findings