Pyfastx

a python package for fast random access to sequences from plain and gzipped FASTA/Q files

Generate Convert Improve

Install / Use

/learn @lmdu/Pyfastx

About this skill

Quality Score

0/100

README

pyfastx #######

.. image:: https://github.com/lmdu/pyfastx/actions/workflows/wheel.yml/badge.svg :target: https://github.com/lmdu/pyfastx/actions/workflows/wheel.yml :alt: Action

.. image:: https://readthedocs.org/projects/pyfastx/badge/?version=latest :target: https://pyfastx.readthedocs.io/en/latest/?badge=latest :alt: Readthedocs

.. image:: https://codecov.io/gh/lmdu/pyfastx/branch/master/graph/badge.svg :target: https://codecov.io/gh/lmdu/pyfastx :alt: Codecov

.. image:: https://img.shields.io/pypi/v/pyfastx.svg :target: https://pypi.org/project/pyfastx :alt: PyPI

.. image:: https://img.shields.io/pypi/wheel/pyfastx.svg :target: https://pypi.org/project/pyfastx :alt: Wheel

.. image:: https://app.codacy.com/project/badge/Grade/80790fa30f444d9d9ece43689d512dae :target: https://app.codacy.com/gh/lmdu/pyfastx/dashboard?utm_source=gh&utm_medium=referral&utm_content=&utm_campaign=Badge_grade :alt: Codacy

.. image:: https://img.shields.io/pypi/implementation/pyfastx :target: https://pypi.org/project/pyfastx :alt: Language

.. image:: https://img.shields.io/pypi/pyversions/pyfastx.svg :target: https://pypi.org/project/pyfastx :alt: Pyver

.. image:: https://img.shields.io/pypi/dm/pyfastx :target: https://pypi.org/project/pyfastx :alt: Downloads

.. image:: https://img.shields.io/pypi/l/pyfastx :target: https://pypi.org/project/pyfastx :alt: License

.. image:: https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat :target: http://bioconda.github.io/recipes/pyfastx/README.html :alt: Bioconda

Citation: Lianming Du, Qin Liu, Zhenxin Fan, Jie Tang, Xiuyue Zhang, Megan Price, Bisong Yue, Kelei Zhao. Pyfastx: a robust Python package for fast random access to sequences from plain and gzipped FASTA/Q files. Briefings in Bioinformatics, 2021, 22(4):bbaa368 <https://doi.org/10.1093/bib/bbaa368>_.

.. contents:: Table of Contents

Introduction

The pyfastx is a lightweight Python C extension that enables users to randomly access to sequences from plain and gzipped FASTA/Q files. This module aims to provide simple APIs for users to extract seqeunce from FASTA and reads from FASTQ by identifier and index number. The pyfastx will build indexes stored in a sqlite3 database file for random access to avoid consuming excessive amount of memory. In addition, the pyfastx can parse standard (sequence is spread into multiple lines with same length) and nonstandard (sequence is spread into one or more lines with different length) FASTA format. This module used kseq.h <https://github.com/attractivechaos/klib/blob/master/kseq.h>_ written by @attractivechaos <https://github.com/attractivechaos>_ in klib <https://github.com/attractivechaos/klib>_ project to parse plain FASTA/Q file and zran.c written by @pauldmccarthy <https://github.com/pauldmccarthy>_ in project indexed_gzip <https://github.com/pauldmccarthy/indexed_gzip>_ to index gzipped file for random access.

This project was heavily inspired by @mdshw5 <https://github.com/mdshw5>'s project pyfaidx <https://github.com/mdshw5/pyfaidx> and @brentp <https://github.com/brentp>'s project pyfasta <https://github.com/brentp/pyfasta>.

Features

Single file for the Python extension
Lightweight, memory efficient for parsing FASTA/Q file
Fast random access to sequences from gzipped FASTA/Q file
Read sequences from FASTA file line by line
Calculate N50 and L50 of sequences in FASTA file
Calculate GC content and nucleotides composition
Extract reverse, complement and antisense sequences
Excellent compatibility, support for parsing nonstandard FASTA file
Support for FASTQ quality score conversion
Provide command line interface for splitting FASTA/Q file

Installation

Currently, pyfastx supports Python 3.8, 3.9, 3.10, 3.11, 3.12, 3.13, 3.14. Make sure you have installed both pip <https://pip.pypa.io/en/stable/installing/>_ and Python before starting.

You can install pyfastx via the Python Package Index (PyPI)

pip install pyfastx

Update pyfastx module

pip install -U pyfastx

FASTX

New in pyfastx 0.8.0.

Pyfastx provide a simple and fast python binding for kseq.h to iterate over sequences or reads in fasta/q file. The FASTX object will automatically detect the input sequence format (fasta or fastq) to return different tuple.

FASTA sequences iteration

When iterating over sequences on FASTX object, a tuple (name, seq) will be returned.

.. code:: python

>>> fa = pyfastx.Fastx('tests/data/test.fa.gz')
>>> for name,seq in fa:
>>>     print(name)
>>>     print(seq)

>>> #always output uppercase sequence
>>> for item in pyfastx.Fastx('tests/data/test.fa', uppercase=True):
>>>     print(item)

>>> #Manually specify sequence format
>>> for item in pyfastx.Fastx('tests/data/test.fa', format="fasta"):
>>>     print(item)

If you want the sequence comment, you can set comment to True, New in pyfastx 0.9.0.

.. code:: python

>>> fa = pyfastx.Fastx('tests/data/test.fa.gz', comment=True)
>>> for name,seq,comment in fa:
>>>     print(name)
>>>     print(seq)
>>>     print(comment)

The comment is the content of header line after the first white space or tab character.

FASTQ reads iteration

When iterating over reads on FASTX object, a tuple (name, seq, qual) will be returned.

.. code:: python

>>> fq = pyfastx.Fastx('tests/data/test.fq.gz')
>>> for name,seq,qual in fq:
>>>     print(name)
>>>     print(seq)
>>>     print(qual)

If you want the read comment, you can set comment to True, New in pyfastx 0.9.0.

.. code:: python

>>> fq = pyfastx.Fastx('tests/data/test.fq.gz', comment=True)
>>> for name,seq,qual,comment in fq:
>>>     print(name)
>>>     print(seq)
>>>     print(qual)
>>>     print(comment)

The comment is the content of header line after the first white space or tab character.

FASTA

Read FASTA file

Read plain or gzipped FASTA file and build index, support for random access to FASTA.

.. code:: python

>>> import pyfastx
>>> fa = pyfastx.Fasta('test/data/test.fa.gz')
>>> fa
<Fasta> test/data/test.fa.gz contains 211 seqs

.. note:: Building index may take some times. The time required to build index depends on the size of FASTA file. If index built, you can randomly access to any sequences in FASTA file. The index file can be reused to save time when you read seqeunces from FASTA file next time.

FASTA records iteration

The fastest way to iterate plain or gzipped FASTA file without building index, the iteration will return a tuple contains name and sequence.

.. code:: python

>>> import pyfastx
>>> for name, seq in pyfastx.Fasta('test/data/test.fa.gz', build_index=False):
>>>     print(name, seq)

You can also iterate sequence object from FASTA object like this:

.. code:: python

>>> import pyfastx
>>> for seq in pyfastx.Fasta('test/data/test.fa.gz'):
>>>     print(seq.name)
>>>     print(seq.seq)
>>>     print(seq.description)

Iteration with build_index=True (default) return sequence object which allows you to access attributions of sequence. New in pyfastx 0.6.3.

Get FASTA information

.. code:: python

>>> # get sequence counts in FASTA
>>> len(fa)
211

>>> # get total sequence length of FASTA
>>> fa.size
86262

>>> # get GC content of DNA sequence of FASTA
>>> fa.gc_content
43.529014587402344

>>> # get GC skew of DNA sequences in FASTA
>>> # New in pyfastx 0.3.8
>>> fa.gc_skew
0.004287730902433395

>>> # get composition of nucleotides in FASTA
>>> fa.composition
{'A': 24534, 'C': 18694, 'G': 18855, 'T': 24179}

>>> # get fasta type (DNA, RNA, or protein)
>>> fa.type
'DNA'

>>> # check fasta file is gzip compressed
>>> fa.is_gzip
True

Get longest and shortest sequence

New in pyfastx 0.3.0

.. code:: python

>>> # get longest sequence
>>> s = fa.longest
>>> s
<Sequence> JZ822609.1 with length of 821

>>> s.name
'JZ822609.1'

>>> len(s)
821

>>> # get shortest sequence
>>> s = fa.shortest
>>> s
<Sequence> JZ822617.1 with length of 118

>>> s.name
'JZ822617.1'

>>> len(s)
118

Calculate N50 and L50

New in pyfastx 0.3.0

Calculate assembly N50 and L50, return (N50, L50), learn more about N50,L50 <https://www.molecularecologist.com/2017/03/whats-n50/>_

.. code:: python

>>> # get FASTA N50 and L50
>>> fa.nl(50)
(516, 66)

>>> # get FASTA N90 and L90
>>> fa.nl(90)
(231, 161)

>>> # get FASTA N75 and L75
>>> fa.nl(75)
(365, 117)

Get sequence mean and median length

New in pyfastx 0.3.0

.. code:: python

>>> # get sequence average length
>>> fa.mean
408

>>> # get seqeunce median length
>>> fa.median
430

Get sequence counts

New in pyfastx 0.3.0

Get counts of sequences whose length >= specified length

.. code:: python

>>> # get counts of sequences with length >= 200 bp
>>> fa.count(200)
173

>>> # get counts of sequences with length >= 500 bp
>>> fa.count(500)
70

Get subsequences

Subsequences can be retrieved from FASTA file by using a list of [start, end] coordinates

.. code:: python

>>> # get subsequence with start and end position
>>> interval = (1, 10)
>>> fa.fetch('JZ822577.1', interval)
'CTCTAGAGAT'

>>> # get subsequences with a list of start and end position
>>> intervals = [(1, 10), (50, 60)]
>>> fa.fetch('JZ822577.1', intervals)
'CTCTAGAGATTTTAGTTTGAC'

>>> # get subsequences with rever

Related Skills

node-connect

335.2k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

claude-opus-4-5-migration

82.5k

Migrate prompts and code from Claude Sonnet 4.0, Sonnet 4.5, or Opus 4.1 to Opus 4.5

frontend-design

82.5k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

model-usage

335.2k

Use CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.

lmdu

View profile

View on GitHub

GitHub Stars290

CategoryDevelopment

Updated1mo ago

Forks24

lmdu/pyfastx

Languages

Security Score

100/100

Audited on Feb 20, 2026

No findings