Bamnostic
a pure Python multi-version tolerant, runtime and OS-agnostic Binary Alignment Map (BAM) file parser and random access tool
Install / Use
/learn @betteridiot/BamnosticREADME
|Platform | Build Status |
|:--------|:------------:|
|Windows | |
|conda |
|
|Host | Downloads |
|:----|:---------:|
|PyPI | |
|conda|
|
BAMnostic
a pure Python, OS-agnositic Binary Alignment Map (BAM) file parser and random access tool.
Note:
Documentation can be found at here or by going to this address: http://bamnostic.readthedocs.io. Documentation was made available through Read the Docs.
Installation
There are 4 methods of installation available (choose one):</br>
Through the conda package manager (Anaconda Cloud)
# first, add the conda-forge channel to your conda build
conda config --add channels conda-forge
# now bamnostic is available for install
conda install --solver=libmamba bamnostic
Through the Python Package Index (PyPI)
pip install bamnostic
# or, if you don't have superuser access
pip install --user bamnostic
Through pip+Github
# again, use --user if you don't have superuser access
pip install -e git+https://github.com/betteridiot/bamnostic.git#egg=bamnostic
# or, if you don't have superuser access
pip install --user -e git+https://github.com/betteridiot/bamnostic.git#bamnostic#egg=bamnostic
Traditional GitHub clone
git clone https://github.com/betteridiot/bamnostic.git
cd bamnostic
pip install -e .
# or, if you don't have superuser access
pip install --user -e .
Quickstart
Bamnostic is meant to be a reduced drop-in replacement for pysam. As such it has much the same API as pysam with regard to BAM-related operations. </br>
Note: the pileup() method is not supported at this time. </br></br>
Importing
>>> import bamnostic as bs
Loading your BAM file (Note: CRAM format are not supported at this time)
Bamnostic comes with an example BAM (and respective BAI) file just to play around with the output. Note, however, that the example BAM file does not contain many reference contigs. Therefore, random access is limited. This example file is made availble through bamnostic.example_bam, which is a just a string path to the BAM file within the package.
>>> bam = bs.AlignmentFile(bs.example_bam, 'rb')
Get the header
Note: this will print out the SAM header. If the SAM header is not in the BAM file, it will print out the dictionary representation of the BAM header. It is a dictionary of refID keys with contig names and length tuple values.
>>> bam.header
{0: ('chr1', 1575), 1: ('chr2', 1584)}
Data validation through head()
>>>bam.head(n=2)
[EAS56_57:6:190:289:82 69 chr1 100 0 * = 100 0 CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA <<<7<<<;<<<<<<<<8;;<7;4<;<;;;;;94<; MF:C:192,
EAS56_57:6:190:289:82 137 chr1 100 73 35M = 100 0 AGGGGTGCAGAGCCGAGTCACGGGGTTGCCAGCAC <<<<<<;<<<<<<<<<<;<<;<<<<;8<6;9;;2; MF:C:64 Aq:C:0 NM:C:0 UQ:C:0 H0:C:1 H1:C:0]
Getting the first read
>>> first_read = next(bam)
>>> print(first_read)
EAS56_57:6:190:289:82 69 chr1 100 0 * = 100 0 CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA <<<7<<<;<<<<<<<<8;;<7;4<;<;;;;;94<; MF:C:192
Exploring the read
# read name
>>> print(first_read.read_name)
EAS56_57:6:190:289:82
# 0-based position
>>> print(first_read.pos)
99
# nucleotide sequence
>>> print(first_read.seq)
CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA
# Read FLAG
>>> print(first_read.flag)
69
# decoded FLAG
>>> bs.utils.flag_decode(first_read.flag)
[(1, 'read paired'), (4, 'read unmapped'), (64, 'first in pair')]
Random Access
>>> for i, read in enumerate(bam.fetch('chr2', 1, 100)):
... if i >= 3:
... break
... print(read)
B7_591:8:4:841:340 73 chr2 1 99 36M * 0 0 TTCAAATGAACTTCTGTAATTGAAAAATTCATTTAA <<<<<<<<;<<<<<<<<;<<<<<;<;:<<<<<<<;; MF:C:18 Aq:C:77 NM:C:0 UQ:C:0 H0:C:1 H1:C:0
EAS54_67:4:142:943:582 73 chr2 1 99 35M * 0 0 TTCAAATGAACTTCTGTAATTGAAAAATTCATTTA <<<<<<;<<<<<<:<<;<<<<;<<<;<<<:;<<<5 MF:C:18 Aq:C:41 NM:C:0 UQ:C:0 H0:C:1 H1:C:0
EAS54_67:6:43:859:229 153 chr2 1 66 35M * 0 0 TTCAAATGAACTTCTGTAATTGAAAAATTCATTTA +37<=<.;<<7.;77<5<<0<<<;<<<27<<<<<< MF:C:32 Aq:C:0 NM:C:0 UQ:C:0 H0:C:1 H1:C:0
Introduction
Next-Generation Sequencing
The field of genomics requires sequencing data produced by Next-Generation sequencing (NGS) platforms (such as Illumina). These data take the form of millions of short strings that represent the nucleotide sequences (A, T, C, or G) of the sample fragments processed by the NGS platform. More information regarding the NGS workflow can be found here </br></br> An example of a single entry (known as FASTQ) can be seen below (FASTQ Format):</br>
@SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC
+SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC
Each entry details the read name, lenght, string representation, and quality of each aligned base along the read.
SAM/BAM Format
The data from the NGS platforms are often aligned to reference genome. That is, each entry goes through an alignment algorithm that finds the best position that the entry matches along a known reference sequence. The alignment step extends the original entry with a sundry of additional attributes. A few of the included attributes are contig, position, and Compact Idiosyncratic Gapped Alignment Report (CIGAR) string. The modified entry is called the An example Sequence Alignment Map (SAM) entry can be see below (SAM format):
@HD VN:1.5 SO:coordinate
@SQ SN:ref LN:45
r001 99 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *
r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA *
r003 0 ref 9 30 5S6M * 0 0 GCCTAAGCTAA * SA:Z:ref,29,-,6H5M,17,0;
r004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC *
r003 2064 ref 29 17 6H5M * 0 0 TAGGC * SA:Z:ref,9,+,5S6M,30,1;
r001 147 ref 37 30 9M = 7 -39 CAGCGGCAT * NM:i:1
There are many benefits to the SAM format: human-readable, each entry is contained to a single line (supporting simple stream analysis), concise description of the read's quality and position, and a file header metadata that supports integrity and reproducibility. </br></br> Additionally, a compressed form of the SAM format was designed in parallel. It is called the Binary Alignment Map (BAM). Using a series of clever byte encoding of each SAM entry, the data are compressed into specialized, concatenated GZIP blocks called Blocked GNU Zip Format (BGZF) blocks. Each BGZF block contains a finite amount of data (≈65Kb). While the whole file is GZIP compatible, each individual block is also independently GZIP compatible. This data structure, ultimately, makes the file larger than just a normal GZIP file, but it also allow for random access within the file though the use of a BAM Index file (BAI).
BAI
The BAI file, often produced via samtools, requires the BAM file to be sorted prior to indexing. Using a modified R-tree binning strategy, each reference contig is divided into sequential, non-overlapping bins. That is a parent bin may contain numerous children, but none of the children bins overlap another's assigned interval. Each BAM entry is then assigned to the bin that fully contains it. A visual description of the binning strategy can be found here. Each bin is comprised of chunks, and each chunk contains its respective start and stop byte positions within the BAM file. </br> In addition to the bin index, a linear index is produced as well. Again, the reference contig is divided into equally sized windows (covering ≈16Kbp/each).
Related Skills
node-connect
343.1kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
90.0kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
343.1kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
343.1kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
