Pyfaidx

Efficient pythonic random access to fasta subsequences

Generate Convert Improve

Install / Use

/learn @mdshw5/Pyfaidx

About this skill

Quality Score

0/100

README

Description

Samtools provides a function "faidx" (FAsta InDeX), which creates a small flat index file ".fai" allowing for fast random access to any subsequence in the indexed FASTA file, while loading a minimal amount of the file in to memory. This python module implements pure Python classes for indexing, retrieval, and in-place modification of FASTA files using a samtools compatible index. The pyfaidx module is API compatible with the pygr_ seqdb module. A command-line script "faidx_" is installed alongside the pyfaidx module, and facilitates complex manipulation of FASTA files without any programming knowledge.

.. _pygr: https://github.com/cjlee112/pygr

If you use pyfaidx in your publication, please cite:

Shirley MD, Ma Z, Pedersen B, Wheelan S. Efficient "pythonic" access to FASTA files using pyfaidx <https://dx.doi.org/10.7287/peerj.preprints.970v1>_. PeerJ PrePrints 3:e1196. 2015.

.. _Shirley MD: http://github.com/mdshw5 .. _Ma Z: http://github.com/azalea .. _Pedersen B: http://github.com/brentp .. _Wheelan S: http://github.com/swheelan

Installation

This package is tested under Linux and macOS using Python 3.7+, and and is available from the PyPI:

pip install pyfaidx  # add --user if you don't have root

or download a release <https://github.com/mdshw5/pyfaidx/releases>_ and:

pip install .

If using pip install --user make sure to add /home/$USER/.local/bin to your $PATH (on linux) or /Users/$USER/Library/Python/{python version}/bin (on macOS) if you want to run the faidx script.

Python 2.6 and 2.7 users may choose to use a package version from v0.7.2 <https://github.com/mdshw5/pyfaidx/releases/tag/v0.7.2.2>_ or earier.

Usage

.. code:: python

>>> from pyfaidx import Fasta
>>> genes = Fasta('tests/data/genes.fasta')
>>> genes
Fasta("tests/data/genes.fasta")  # set strict_bounds=True for bounds checking

Acts like a dictionary.

.. code:: python

>>> genes.keys()
('AB821309.1', 'KF435150.1', 'KF435149.1', 'NR_104216.1', 'NR_104215.1', 'NR_104212.1', 'NM_001282545.1', 'NM_001282543.1', 'NM_000465.3', 'NM_001282549.1', 'NM_001282548.1', 'XM_005249645.1', 'XM_005249644.1', 'XM_005249643.1', 'XM_005249642.1', 'XM_005265508.1', 'XM_005265507.1', 'XR_241081.1', 'XR_241080.1', 'XR_241079.1')

>>> genes['NM_001282543.1'][200:230]
>NM_001282543.1:201-230
CTCGTTCCGCGCCCGCCATGGAACCGGATG

>>> genes['NM_001282543.1'][200:230].seq
'CTCGTTCCGCGCCCGCCATGGAACCGGATG'

>>> genes['NM_001282543.1'][200:230].name
'NM_001282543.1'

# Start attributes are 1-based
>>> genes['NM_001282543.1'][200:230].start
201

# End attributes are 0-based
>>> genes['NM_001282543.1'][200:230].end
230

>>> genes['NM_001282543.1'][200:230].fancy_name
'NM_001282543.1:201-230'

>>> len(genes['NM_001282543.1'])
5466

Note that start and end coordinates of Sequence objects are [1, 0]. This can be changed to [0, 0] by passing one_based_attributes=False to Fasta or Faidx. This argument only affects the Sequence .start/.end attributes, and has no effect on slicing coordinates.

Indexes like a list:

.. code:: python

>>> genes[0][:50]
>AB821309.1:1-50
ATGGTCAGCTGGGGTCGTTTCATCTGCCTGGTCGTGGTCACCATGGCAAC

Slices just like a string:

.. code:: python

>>> genes['NM_001282543.1'][200:230][:10]
>NM_001282543.1:201-210
CTCGTTCCGC

>>> genes['NM_001282543.1'][200:230][::-1]
>NM_001282543.1:230-201
GTAGGCCAAGGTACCGCCCGCGCCTTGCTC

>>> genes['NM_001282543.1'][200:230][::3]
>NM_001282543.1:201-230
CGCCCCTACA

>>> genes['NM_001282543.1'][:]
>NM_001282543.1:1-5466
CCCCGCCCCT........

Slicing start and end coordinates are 0-based, just like Python sequences.

Complements and reverse complements just like DNA

.. code:: python

>>> genes['NM_001282543.1'][200:230].complement
>NM_001282543.1 (complement):201-230
GAGCAAGGCGCGGGCGGTACCTTGGCCTAC

>>> genes['NM_001282543.1'][200:230].reverse
>NM_001282543.1:230-201
GTAGGCCAAGGTACCGCCCGCGCCTTGCTC

>>> -genes['NM_001282543.1'][200:230]
>NM_001282543.1 (complement):230-201
CATCCGGTTCCATGGCGGGCGCGGAACGAG

Fasta objects can also be accessed using method calls:

.. code:: python

>>> genes.get_seq('NM_001282543.1', 201, 210)
>NM_001282543.1:201-210
CTCGTTCCGC

>>> genes.get_seq('NM_001282543.1', 201, 210, rc=True)
>NM_001282543.1 (complement):210-201
GCGGAACGAG

Spliced sequences can be retrieved from a list of [start, end] coordinates: TODO update this section

.. code:: python

# new in v0.5.1
segments = [[1, 10], [50, 70]]
>>> genes.get_spliced_seq('NM_001282543.1', segments)
>gi|543583786|ref|NM_001282543.1|:1-70
CCCCGCCCCTGGTTTCGAGTCGCTGGCCTGC

.. _keyfn:

Custom key functions provide cleaner access:

.. code:: python

>>> from pyfaidx import Fasta
>>> genes = Fasta('tests/data/genes.fasta', key_function = lambda x: x.split('.')[0])
>>> genes.keys()
dict_keys(['NR_104212', 'NM_001282543', 'XM_005249644', 'XM_005249645', 'NR_104216', 'XM_005249643', 'NR_104215', 'KF435150', 'AB821309', 'NM_001282549', 'XR_241081', 'KF435149', 'XR_241079', 'NM_000465', 'XM_005265508', 'XR_241080', 'XM_005249642', 'NM_001282545', 'XM_005265507', 'NM_001282548'])
>>> genes['NR_104212'][:10]
>NR_104212:1-10
CCCCGCCCCT

You can specify a character to split names on, which will generate additional entries:

.. code:: python

>>> from pyfaidx import Fasta
>>> genes = Fasta('tests/data/genes.fasta', split_char='.', duplicate_action="first") # default duplicate_action="stop"
>>> genes.keys()
dict_keys(['.1', 'NR_104212', 'NM_001282543', 'XM_005249644', 'XM_005249645', 'NR_104216', 'XM_005249643', 'NR_104215', 'KF435150', 'AB821309', 'NM_001282549', 'XR_241081', 'KF435149', 'XR_241079', 'NM_000465', 'XM_005265508', 'XR_241080', 'XM_005249642', 'NM_001282545', 'XM_005265507', 'NM_001282548'])

If your key_function or split_char generates duplicate entries, you can choose what action to take:

.. code:: python

# new in v0.4.9
>>> genes = Fasta('tests/data/genes.fasta', split_char="|", duplicate_action="longest")
>>> genes.keys()
dict_keys(['gi', '563317589', 'dbj', 'AB821309.1', '', '557361099', 'gb', 'KF435150.1', '557361097', 'KF435149.1', '543583796', 'ref', 'NR_104216.1', '543583795', 'NR_104215.1', '543583794', 'NR_104212.1', '543583788', 'NM_001282545.1', '543583786', 'NM_001282543.1', '543583785', 'NM_000465.3', '543583740', 'NM_001282549.1', '543583738', 'NM_001282548.1', '530384540', 'XM_005249645.1', '530384538', 'XM_005249644.1', '530384536', 'XM_005249643.1', '530384534', 'XM_005249642.1', '530373237','XM_005265508.1', '530373235', 'XM_005265507.1', '530364726', 'XR_241081.1', '530364725', 'XR_241080.1', '530364724', 'XR_241079.1'])

Filter functions (returning True) limit the index:

.. code:: python

# new in v0.3.8
>>> from pyfaidx import Fasta
>>> genes = Fasta('tests/data/genes.fasta', filt_function = lambda x: x[0] == 'N')
>>> genes.keys()
dict_keys(['NR_104212', 'NM_001282543', 'NR_104216', 'NR_104215', 'NM_001282549', 'NM_000465', 'NM_001282545', 'NM_001282548'])
>>> genes['XM_005249644']
KeyError: XM_005249644 not in tests/data/genes.fasta.

Or just get a Python string:

.. code:: python

>>> from pyfaidx import Fasta
>>> genes = Fasta('tests/data/genes.fasta', as_raw=True)
>>> genes
Fasta("tests/data/genes.fasta", as_raw=True)

>>> genes['NM_001282543.1'][200:230]
CTCGTTCCGCGCCCGCCATGGAACCGGATG

You can make sure that you always receive an uppercase sequence, even if your fasta file has lower case

.. code:: python

>>> from pyfaidx import Fasta
>>> reference = Fasta('tests/data/genes.fasta.lower', sequence_always_upper=True)
>>> reference['gi|557361099|gb|KF435150.1|'][1:70]

>gi|557361099|gb|KF435150.1|:2-70
TGACATCATTTTCCACCTCTGCTCAGTGTTCAACATCTGACAGTGCTTGCAGGATCTCTCCTGGACAAA

You can also perform line-based iteration, receiving the sequence lines as they appear in the FASTA file:

.. code:: python

>>> from pyfaidx import Fasta
>>> genes = Fasta('tests/data/genes.fasta')
>>> for line in genes['NM_001282543.1']:
...   print(line)
CCCCGCCCCTCTGGCGGCCCGCCGTCCCAGACGCGGGAAGAGCTTGGCCGGTTTCGAGTCGCTGGCCTGC
AGCTTCCCTGTGGTTTCCCGAGGCTTCCTTGCTTCCCGCTCTGCGAGGAGCCTTTCATCCGAAGGCGGGA
CGATGCCGGATAATCGGCAGCCGAGGAACCGGCAGCCGAGGATCCGCTCCGGGAACGAGCCTCGTTCCGC
...

Sequence names are truncated on any whitespace. This is a limitation of the indexing strategy. However, full names can be recovered:

.. code:: python

# new in v0.3.7
>>> from pyfaidx import Fasta
>>> genes = Fasta('tests/data/genes.fasta')
>>> for record in genes:
...   print(record.name)
...   print(record.long_name)
...
gi|563317589|dbj|AB821309.1|
gi|563317589|dbj|AB821309.1| Homo sapiens FGFR2-AHCYL1 mRNA for FGFR2-AHCYL1 fusion kinase protein, complete cds
gi|557361099|gb|KF435150.1|
gi|557361099|gb|KF435150.1| Homo sapiens MDM4 protein variant Y (MDM4) mRNA, complete cds, alternatively spliced
gi|557361097|gb|KF435149.1|
gi|557361097|gb|KF435149.1| Homo sapiens MDM4 protein variant G (MDM4) mRNA, complete cds
...

# new in v0.4.9
>>> from pyfaidx import Fasta
>>> genes = Fasta('tests/data/genes.fasta', read_long_names=True)
>>> for record in genes:
...   print(record.name)
...
gi|563317589|dbj|AB821309.1| Homo sapiens FGFR2-AHCYL1 mRNA for FGFR2-AHCYL1 fusion kinase protein, complete cds
gi|557361099|gb|KF435150.1| Homo sapiens MDM4 protein variant Y (MDM4) mRNA, complete cds, alternatively spliced
gi|557361097|gb|KF435149.1| Homo sapiens MDM4 protein variant G (MDM4) mRNA, complete cds

Records can be accessed efficien

Related Skills

node-connect

335.2k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

claude-opus-4-5-migration

82.5k

Migrate prompts and code from Claude Sonnet 4.0, Sonnet 4.5, or Opus 4.1 to Opus 4.5

frontend-design

82.5k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

model-usage

335.2k

Use CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.

mdshw5

View profile

View on GitHub

GitHub Stars482

CategoryDevelopment

Updated5d ago

Forks76

mdshw5/pyfaidx

Languages

Python

Security Score

85/100

Audited on Mar 19, 2026

No findings