🐍🟡♦️🟦 PyHMMER

Cython bindings and Python interface to HMMER3.

🗺️ Overview

HMMER is a biological sequence analysis tool that uses profile hidden Markov models to search for sequence homologs. HMMER3 is developed and maintained by the Eddy/Rivas Laboratory at Harvard University.

pyhmmer is a Python package, implemented using the Cython language, that provides bindings to HMMER3. It directly interacts with the HMMER internals, which has the following advantages over CLI wrappers:

single dependency: If your software or your analysis pipeline is distributed as a Python package, you can add pyhmmer as a dependency to your project, and stop worrying about the HMMER binaries being properly setup on the end-user machine.
no intermediate files: Everything happens in memory, in Python objects you have control on, making it easier to pass your inputs to HMMER without needing to write them to a temporary file. Output retrieval is also done in memory, via instances of the pyhmmer.plan7.TopHits class.
no input formatting: The Easel object model is exposed in the pyhmmer.easel module, and you have the possibility to build a DigitalSequence object yourself to pass to the HMMER pipeline. This is useful if your sequences are already loaded in memory, for instance because you obtained them from another Python library (such as Pyrodigal or Biopython).
no output parsing: HMMER3 is notorious for its numerous output files and its fixed-width tabular output, which is hard to parse (even Bio.SearchIO.HmmerIO is struggling on some sequences).
efficient: Using pyhmmer to launch hmmsearch on sequences and HMMs in disk storage is typically as fast as directly using the hmmsearch binary (see the Benchmarks section). pyhmmer.hmmer.hmmsearch uses a different parallelisation strategy compared to the hmmsearch binary from HMMER, which can help getting the most of multiple CPUs when annotating smaller sequence databases.

This library is still a work-in-progress. It follows semantic-versioning, so API changes will be documented, but past v0.10 the API has been more or less stable. It should already pack enough features to run biological analyses or workflows involving hmmsearch, hmmscan, nhmmer, phmmer, hmmbuild and hmmalign.

🔧 Installing

pyhmmer can be installed from PyPI, which hosts some pre-built CPython wheels for Linux and MacOS on x86-64 and Arm64, as well as the code required to compile from source with Cython:

$ pip install pyhmmer

Compilation for UNIX PowerPC is not tested in CI, but should work out of the box. Note than non-UNIX operating systems (such as Windows) are not supported by HMMER.

A Bioconda package is also available:

$ conda install -c bioconda pyhmmer

See the Installation page of the documentation to find other ways to install pyhmmer.

🔖 Citation

PyHMMER is scientific software, with a published paper in the Bioinformatics. Please cite both PyHMMER and HMMER if you are using it in an academic work, for instance as:

PyHMMER (Larralde et al., 2023), a Python library binding to HMMER (Eddy, 2011).

Detailed references are available on the Publications page of the online documentation.

📖 Documentation

A complete API reference can be found in the online documentation, or directly from the command line using pydoc:

$ pydoc pyhmmer.easel
$ pydoc pyhmmer.plan7

💡 Example

Use pyhmmer to run hmmsearch to search for Type 2 PKS domains (t2pks.hmm) inside proteins extracted from the genome of Anaerococcus provencensis (938293.PRJEB85.HG003687.faa). This will produce an iterable over TopHits that can be used for further sorting/querying in Python. Processing happens in parallel using Python threads, and a TopHits object is yielded for every HMM passed in the input iterable.

import pyhmmer

with pyhmmer.easel.SequenceFile("pyhmmer/tests/data/seqs/938293.PRJEB85.HG003687.faa", digital=True) as seq_file:
    sequences = seq_file.read_block()

with pyhmmer.plan7.HMMFile("pyhmmer/tests/data/hmms/txt/t2pks.hmm") as hmm_file:
    for hits in pyhmmer.hmmsearch(hmm_file, sequences, cpus=4):
      print(f"HMM {hits.query.name} found {len(hits)} hits in the target sequences")

Have a look at more in-depth examples such as building a HMM from an alignment, analysing the active site of a hit, or fetching marker genes from a genome in the Examples page of the online documentation.

💭 Feedback

⚠️ Issue Tracker

Found a bug ? Have an enhancement request ? Head over to the GitHub issue tracker if you need to report or ask something. If you are filing in on a bug, please include as much information as you can about the issue, and try to recreate the same bug in a simple, easily reproducible situation.

🏗️ Contributing

Contributions are more than welcome! See CONTRIBUTING.md for more details.

⏱️ Benchmarks

Benchmarks were run on a [i7-10710U CPU](https://ark.intel.com/content/www

Pyhmmer

Install / Use

README