SkillAgentSearch skills...

Pyfastx

a python package for fast random access to sequences from plain and gzipped FASTA/Q files

Install / Use

/learn @lmdu/Pyfastx
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

pyfastx #######

.. image:: https://github.com/lmdu/pyfastx/actions/workflows/wheel.yml/badge.svg :target: https://github.com/lmdu/pyfastx/actions/workflows/wheel.yml :alt: Action

.. image:: https://readthedocs.org/projects/pyfastx/badge/?version=latest :target: https://pyfastx.readthedocs.io/en/latest/?badge=latest :alt: Readthedocs

.. image:: https://codecov.io/gh/lmdu/pyfastx/branch/master/graph/badge.svg :target: https://codecov.io/gh/lmdu/pyfastx :alt: Codecov

.. image:: https://img.shields.io/pypi/v/pyfastx.svg :target: https://pypi.org/project/pyfastx :alt: PyPI

.. image:: https://img.shields.io/pypi/wheel/pyfastx.svg :target: https://pypi.org/project/pyfastx :alt: Wheel

.. image:: https://app.codacy.com/project/badge/Grade/80790fa30f444d9d9ece43689d512dae :target: https://app.codacy.com/gh/lmdu/pyfastx/dashboard?utm_source=gh&utm_medium=referral&utm_content=&utm_campaign=Badge_grade :alt: Codacy

.. image:: https://img.shields.io/pypi/implementation/pyfastx :target: https://pypi.org/project/pyfastx :alt: Language

.. image:: https://img.shields.io/pypi/pyversions/pyfastx.svg :target: https://pypi.org/project/pyfastx :alt: Pyver

.. image:: https://img.shields.io/pypi/dm/pyfastx :target: https://pypi.org/project/pyfastx :alt: Downloads

.. image:: https://img.shields.io/pypi/l/pyfastx :target: https://pypi.org/project/pyfastx :alt: License

.. image:: https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat :target: http://bioconda.github.io/recipes/pyfastx/README.html :alt: Bioconda

Citation: Lianming Du, Qin Liu, Zhenxin Fan, Jie Tang, Xiuyue Zhang, Megan Price, Bisong Yue, Kelei Zhao. Pyfastx: a robust Python package for fast random access to sequences from plain and gzipped FASTA/Q files. Briefings in Bioinformatics, 2021, 22(4):bbaa368 <https://doi.org/10.1093/bib/bbaa368>_.

.. contents:: Table of Contents

Introduction

The pyfastx is a lightweight Python C extension that enables users to randomly access to sequences from plain and gzipped FASTA/Q files. This module aims to provide simple APIs for users to extract seqeunce from FASTA and reads from FASTQ by identifier and index number. The pyfastx will build indexes stored in a sqlite3 database file for random access to avoid consuming excessive amount of memory. In addition, the pyfastx can parse standard (sequence is spread into multiple lines with same length) and nonstandard (sequence is spread into one or more lines with different length) FASTA format. This module used kseq.h <https://github.com/attractivechaos/klib/blob/master/kseq.h>_ written by @attractivechaos <https://github.com/attractivechaos>_ in klib <https://github.com/attractivechaos/klib>_ project to parse plain FASTA/Q file and zran.c written by @pauldmccarthy <https://github.com/pauldmccarthy>_ in project indexed_gzip <https://github.com/pauldmccarthy/indexed_gzip>_ to index gzipped file for random access.

This project was heavily inspired by @mdshw5 <https://github.com/mdshw5>'s project pyfaidx <https://github.com/mdshw5/pyfaidx> and @brentp <https://github.com/brentp>'s project pyfasta <https://github.com/brentp/pyfasta>.

Features

  • Single file for the Python extension
  • Lightweight, memory efficient for parsing FASTA/Q file
  • Fast random access to sequences from gzipped FASTA/Q file
  • Read sequences from FASTA file line by line
  • Calculate N50 and L50 of sequences in FASTA file
  • Calculate GC content and nucleotides composition
  • Extract reverse, complement and antisense sequences
  • Excellent compatibility, support for parsing nonstandard FASTA file
  • Support for FASTQ quality score conversion
  • Provide command line interface for splitting FASTA/Q file

Installation

Currently, pyfastx supports Python 3.8, 3.9, 3.10, 3.11, 3.12, 3.13, 3.14. Make sure you have installed both pip <https://pip.pypa.io/en/stable/installing/>_ and Python before starting.

You can install pyfastx via the Python Package Index (PyPI)

::

pip install pyfastx

Update pyfastx module

::

pip install -U pyfastx

FASTX

New in pyfastx 0.8.0.

Pyfastx provide a simple and fast python binding for kseq.h to iterate over sequences or reads in fasta/q file. The FASTX object will automatically detect the input sequence format (fasta or fastq) to return different tuple.

FASTA sequences iteration

When iterating over sequences on FASTX object, a tuple (name, seq) will be returned.

.. code:: python

>>> fa = pyfastx.Fastx('tests/data/test.fa.gz')
>>> for name,seq in fa:
>>>     print(name)
>>>     print(seq)

>>> #always output uppercase sequence
>>> for item in pyfastx.Fastx('tests/data/test.fa', uppercase=True):
>>>     print(item)

>>> #Manually specify sequence format
>>> for item in pyfastx.Fastx('tests/data/test.fa', format="fasta"):
>>>     print(item)

If you want the sequence comment, you can set comment to True, New in pyfastx 0.9.0.

.. code:: python

>>> fa = pyfastx.Fastx('tests/data/test.fa.gz', comment=True)
>>> for name,seq,comment in fa:
>>>     print(name)
>>>     print(seq)
>>>     print(comment)

The comment is the content of header line after the first white space or tab character.

FASTQ reads iteration

When iterating over reads on FASTX object, a tuple (name, seq, qual) will be returned.

.. code:: python

>>> fq = pyfastx.Fastx('tests/data/test.fq.gz')
>>> for name,seq,qual in fq:
>>>     print(name)
>>>     print(seq)
>>>     print(qual)

If you want the read comment, you can set comment to True, New in pyfastx 0.9.0.

.. code:: python

>>> fq = pyfastx.Fastx('tests/data/test.fq.gz', comment=True)
>>> for name,seq,qual,comment in fq:
>>>     print(name)
>>>     print(seq)
>>>     print(qual)
>>>     print(comment)

The comment is the content of header line after the first white space or tab character.

FASTA

Read FASTA file

Read plain or gzipped FASTA file and build index, support for random access to FASTA.

.. code:: python

>>> import pyfastx
>>> fa = pyfastx.Fasta('test/data/test.fa.gz')
>>> fa
<Fasta> test/data/test.fa.gz contains 211 seqs

.. note:: Building index may take some times. The time required to build index depends on the size of FASTA file. If index built, you can randomly access to any sequences in FASTA file. The index file can be reused to save time when you read seqeunces from FASTA file next time.

FASTA records iteration

The fastest way to iterate plain or gzipped FASTA file without building index, the iteration will return a tuple contains name and sequence.

.. code:: python

>>> import pyfastx
>>> for name, seq in pyfastx.Fasta('test/data/test.fa.gz', build_index=False):
>>>     print(name, seq)

You can also iterate sequence object from FASTA object like this:

.. code:: python

>>> import pyfastx
>>> for seq in pyfastx.Fasta('test/data/test.fa.gz'):
>>>     print(seq.name)
>>>     print(seq.seq)
>>>     print(seq.description)

Iteration with build_index=True (default) return sequence object which allows you to access attributions of sequence. New in pyfastx 0.6.3.

Get FASTA information

.. code:: python

>>> # get sequence counts in FASTA
>>> len(fa)
211

>>> # get total sequence length of FASTA
>>> fa.size
86262

>>> # get GC content of DNA sequence of FASTA
>>> fa.gc_content
43.529014587402344

>>> # get GC skew of DNA sequences in FASTA
>>> # New in pyfastx 0.3.8
>>> fa.gc_skew
0.004287730902433395

>>> # get composition of nucleotides in FASTA
>>> fa.composition
{'A': 24534, 'C': 18694, 'G': 18855, 'T': 24179}

>>> # get fasta type (DNA, RNA, or protein)
>>> fa.type
'DNA'

>>> # check fasta file is gzip compressed
>>> fa.is_gzip
True

Get longest and shortest sequence

New in pyfastx 0.3.0

.. code:: python

>>> # get longest sequence
>>> s = fa.longest
>>> s
<Sequence> JZ822609.1 with length of 821

>>> s.name
'JZ822609.1'

>>> len(s)
821

>>> # get shortest sequence
>>> s = fa.shortest
>>> s
<Sequence> JZ822617.1 with length of 118

>>> s.name
'JZ822617.1'

>>> len(s)
118

Calculate N50 and L50

New in pyfastx 0.3.0

Calculate assembly N50 and L50, return (N50, L50), learn more about N50,L50 <https://www.molecularecologist.com/2017/03/whats-n50/>_

.. code:: python

>>> # get FASTA N50 and L50
>>> fa.nl(50)
(516, 66)

>>> # get FASTA N90 and L90
>>> fa.nl(90)
(231, 161)

>>> # get FASTA N75 and L75
>>> fa.nl(75)
(365, 117)

Get sequence mean and median length

New in pyfastx 0.3.0

.. code:: python

>>> # get sequence average length
>>> fa.mean
408

>>> # get seqeunce median length
>>> fa.median
430

Get sequence counts

New in pyfastx 0.3.0

Get counts of sequences whose length >= specified length

.. code:: python

>>> # get counts of sequences with length >= 200 bp
>>> fa.count(200)
173

>>> # get counts of sequences with length >= 500 bp
>>> fa.count(500)
70

Get subsequences

Subsequences can be retrieved from FASTA file by using a list of [start, end] coordinates

.. code:: python

>>> # get subsequence with start and end position
>>> interval = (1, 10)
>>> fa.fetch('JZ822577.1', interval)
'CTCTAGAGAT'

>>> # get subsequences with a list of start and end position
>>> intervals = [(1, 10), (50, 60)]
>>> fa.fetch('JZ822577.1', intervals)
'CTCTAGAGATTTTAGTTTGAC'

>>> # get subsequences with rever

Related Skills

View on GitHub
GitHub Stars290
CategoryDevelopment
Updated1mo ago
Forks24

Languages

C

Security Score

100/100

Audited on Feb 20, 2026

No findings