Pyfastx
a python package for fast random access to sequences from plain and gzipped FASTA/Q files
Install / Use
/learn @lmdu/PyfastxREADME
pyfastx #######
.. image:: https://github.com/lmdu/pyfastx/actions/workflows/wheel.yml/badge.svg :target: https://github.com/lmdu/pyfastx/actions/workflows/wheel.yml :alt: Action
.. image:: https://readthedocs.org/projects/pyfastx/badge/?version=latest :target: https://pyfastx.readthedocs.io/en/latest/?badge=latest :alt: Readthedocs
.. image:: https://codecov.io/gh/lmdu/pyfastx/branch/master/graph/badge.svg :target: https://codecov.io/gh/lmdu/pyfastx :alt: Codecov
.. image:: https://img.shields.io/pypi/v/pyfastx.svg :target: https://pypi.org/project/pyfastx :alt: PyPI
.. image:: https://img.shields.io/pypi/wheel/pyfastx.svg :target: https://pypi.org/project/pyfastx :alt: Wheel
.. image:: https://app.codacy.com/project/badge/Grade/80790fa30f444d9d9ece43689d512dae :target: https://app.codacy.com/gh/lmdu/pyfastx/dashboard?utm_source=gh&utm_medium=referral&utm_content=&utm_campaign=Badge_grade :alt: Codacy
.. image:: https://img.shields.io/pypi/implementation/pyfastx :target: https://pypi.org/project/pyfastx :alt: Language
.. image:: https://img.shields.io/pypi/pyversions/pyfastx.svg :target: https://pypi.org/project/pyfastx :alt: Pyver
.. image:: https://img.shields.io/pypi/dm/pyfastx :target: https://pypi.org/project/pyfastx :alt: Downloads
.. image:: https://img.shields.io/pypi/l/pyfastx :target: https://pypi.org/project/pyfastx :alt: License
.. image:: https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat :target: http://bioconda.github.io/recipes/pyfastx/README.html :alt: Bioconda
Citation:
Lianming Du, Qin Liu, Zhenxin Fan, Jie Tang, Xiuyue Zhang, Megan Price, Bisong Yue, Kelei Zhao. Pyfastx: a robust Python package for fast random access to sequences from plain and gzipped FASTA/Q files. Briefings in Bioinformatics, 2021, 22(4):bbaa368 <https://doi.org/10.1093/bib/bbaa368>_.
.. contents:: Table of Contents
Introduction
The pyfastx is a lightweight Python C extension that enables users to randomly access to sequences from plain and gzipped FASTA/Q files. This module aims to provide simple APIs for users to extract seqeunce from FASTA and reads from FASTQ by identifier and index number. The pyfastx will build indexes stored in a sqlite3 database file for random access to avoid consuming excessive amount of memory. In addition, the pyfastx can parse standard (sequence is spread into multiple lines with same length) and nonstandard (sequence is spread into one or more lines with different length) FASTA format. This module used kseq.h <https://github.com/attractivechaos/klib/blob/master/kseq.h>_ written by @attractivechaos <https://github.com/attractivechaos>_ in klib <https://github.com/attractivechaos/klib>_ project to parse plain FASTA/Q file and zran.c written by @pauldmccarthy <https://github.com/pauldmccarthy>_ in project indexed_gzip <https://github.com/pauldmccarthy/indexed_gzip>_ to index gzipped file for random access.
This project was heavily inspired by @mdshw5 <https://github.com/mdshw5>'s project pyfaidx <https://github.com/mdshw5/pyfaidx> and @brentp <https://github.com/brentp>'s project pyfasta <https://github.com/brentp/pyfasta>.
Features
- Single file for the Python extension
- Lightweight, memory efficient for parsing FASTA/Q file
- Fast random access to sequences from
gzippedFASTA/Q file - Read sequences from FASTA file line by line
- Calculate N50 and L50 of sequences in FASTA file
- Calculate GC content and nucleotides composition
- Extract reverse, complement and antisense sequences
- Excellent compatibility, support for parsing nonstandard FASTA file
- Support for FASTQ quality score conversion
- Provide command line interface for splitting FASTA/Q file
Installation
Currently, pyfastx supports Python 3.8, 3.9, 3.10, 3.11, 3.12, 3.13, 3.14. Make sure you have installed both pip <https://pip.pypa.io/en/stable/installing/>_ and Python before starting.
You can install pyfastx via the Python Package Index (PyPI)
::
pip install pyfastx
Update pyfastx module
::
pip install -U pyfastx
FASTX
New in pyfastx 0.8.0.
Pyfastx provide a simple and fast python binding for kseq.h to iterate over sequences or reads in fasta/q file. The FASTX object will automatically detect the input sequence format (fasta or fastq) to return different tuple.
FASTA sequences iteration
When iterating over sequences on FASTX object, a tuple (name, seq) will be returned.
.. code:: python
>>> fa = pyfastx.Fastx('tests/data/test.fa.gz')
>>> for name,seq in fa:
>>> print(name)
>>> print(seq)
>>> #always output uppercase sequence
>>> for item in pyfastx.Fastx('tests/data/test.fa', uppercase=True):
>>> print(item)
>>> #Manually specify sequence format
>>> for item in pyfastx.Fastx('tests/data/test.fa', format="fasta"):
>>> print(item)
If you want the sequence comment, you can set comment to True, New in pyfastx 0.9.0.
.. code:: python
>>> fa = pyfastx.Fastx('tests/data/test.fa.gz', comment=True)
>>> for name,seq,comment in fa:
>>> print(name)
>>> print(seq)
>>> print(comment)
The comment is the content of header line after the first white space or tab character.
FASTQ reads iteration
When iterating over reads on FASTX object, a tuple (name, seq, qual) will be returned.
.. code:: python
>>> fq = pyfastx.Fastx('tests/data/test.fq.gz')
>>> for name,seq,qual in fq:
>>> print(name)
>>> print(seq)
>>> print(qual)
If you want the read comment, you can set comment to True, New in pyfastx 0.9.0.
.. code:: python
>>> fq = pyfastx.Fastx('tests/data/test.fq.gz', comment=True)
>>> for name,seq,qual,comment in fq:
>>> print(name)
>>> print(seq)
>>> print(qual)
>>> print(comment)
The comment is the content of header line after the first white space or tab character.
FASTA
Read FASTA file
Read plain or gzipped FASTA file and build index, support for random access to FASTA.
.. code:: python
>>> import pyfastx
>>> fa = pyfastx.Fasta('test/data/test.fa.gz')
>>> fa
<Fasta> test/data/test.fa.gz contains 211 seqs
.. note:: Building index may take some times. The time required to build index depends on the size of FASTA file. If index built, you can randomly access to any sequences in FASTA file. The index file can be reused to save time when you read seqeunces from FASTA file next time.
FASTA records iteration
The fastest way to iterate plain or gzipped FASTA file without building index, the iteration will return a tuple contains name and sequence.
.. code:: python
>>> import pyfastx
>>> for name, seq in pyfastx.Fasta('test/data/test.fa.gz', build_index=False):
>>> print(name, seq)
You can also iterate sequence object from FASTA object like this:
.. code:: python
>>> import pyfastx
>>> for seq in pyfastx.Fasta('test/data/test.fa.gz'):
>>> print(seq.name)
>>> print(seq.seq)
>>> print(seq.description)
Iteration with build_index=True (default) return sequence object which allows you to access attributions of sequence. New in pyfastx 0.6.3.
Get FASTA information
.. code:: python
>>> # get sequence counts in FASTA
>>> len(fa)
211
>>> # get total sequence length of FASTA
>>> fa.size
86262
>>> # get GC content of DNA sequence of FASTA
>>> fa.gc_content
43.529014587402344
>>> # get GC skew of DNA sequences in FASTA
>>> # New in pyfastx 0.3.8
>>> fa.gc_skew
0.004287730902433395
>>> # get composition of nucleotides in FASTA
>>> fa.composition
{'A': 24534, 'C': 18694, 'G': 18855, 'T': 24179}
>>> # get fasta type (DNA, RNA, or protein)
>>> fa.type
'DNA'
>>> # check fasta file is gzip compressed
>>> fa.is_gzip
True
Get longest and shortest sequence
New in pyfastx 0.3.0
.. code:: python
>>> # get longest sequence
>>> s = fa.longest
>>> s
<Sequence> JZ822609.1 with length of 821
>>> s.name
'JZ822609.1'
>>> len(s)
821
>>> # get shortest sequence
>>> s = fa.shortest
>>> s
<Sequence> JZ822617.1 with length of 118
>>> s.name
'JZ822617.1'
>>> len(s)
118
Calculate N50 and L50
New in pyfastx 0.3.0
Calculate assembly N50 and L50, return (N50, L50), learn more about N50,L50 <https://www.molecularecologist.com/2017/03/whats-n50/>_
.. code:: python
>>> # get FASTA N50 and L50
>>> fa.nl(50)
(516, 66)
>>> # get FASTA N90 and L90
>>> fa.nl(90)
(231, 161)
>>> # get FASTA N75 and L75
>>> fa.nl(75)
(365, 117)
Get sequence mean and median length
New in pyfastx 0.3.0
.. code:: python
>>> # get sequence average length
>>> fa.mean
408
>>> # get seqeunce median length
>>> fa.median
430
Get sequence counts
New in pyfastx 0.3.0
Get counts of sequences whose length >= specified length
.. code:: python
>>> # get counts of sequences with length >= 200 bp
>>> fa.count(200)
173
>>> # get counts of sequences with length >= 500 bp
>>> fa.count(500)
70
Get subsequences
Subsequences can be retrieved from FASTA file by using a list of [start, end] coordinates
.. code:: python
>>> # get subsequence with start and end position
>>> interval = (1, 10)
>>> fa.fetch('JZ822577.1', interval)
'CTCTAGAGAT'
>>> # get subsequences with a list of start and end position
>>> intervals = [(1, 10), (50, 60)]
>>> fa.fetch('JZ822577.1', intervals)
'CTCTAGAGATTTTAGTTTGAC'
>>> # get subsequences with rever
Related Skills
node-connect
335.2kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
claude-opus-4-5-migration
82.5kMigrate prompts and code from Claude Sonnet 4.0, Sonnet 4.5, or Opus 4.1 to Opus 4.5
frontend-design
82.5kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
model-usage
335.2kUse CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.
