Pyhmmer
Cython bindings and Python interface to HMMER3.
Install / Use
/learn @althonos/PyhmmerREADME
🐍🟡♦️🟦 PyHMMER 
Cython bindings and Python interface to HMMER3.
🗺️ Overview
HMMER is a biological sequence analysis tool that uses profile hidden Markov models to search for sequence homologs. HMMER3 is developed and maintained by the Eddy/Rivas Laboratory at Harvard University.
pyhmmer is a Python package, implemented using the Cython
language, that provides bindings to HMMER3. It directly interacts with the
HMMER internals, which has the following advantages over CLI wrappers:
- single dependency: If your software or your analysis pipeline is
distributed as a Python package, you can add
pyhmmeras a dependency to your project, and stop worrying about the HMMER binaries being properly setup on the end-user machine. - no intermediate files: Everything happens in memory, in Python objects
you have control on, making it easier to pass your inputs to HMMER without
needing to write them to a temporary file. Output retrieval is also done
in memory, via instances of the
pyhmmer.plan7.TopHitsclass. - no input formatting: The Easel object model is exposed in the
pyhmmer.easelmodule, and you have the possibility to build aDigitalSequenceobject yourself to pass to the HMMER pipeline. This is useful if your sequences are already loaded in memory, for instance because you obtained them from another Python library (such as Pyrodigal or Biopython). - no output parsing: HMMER3 is notorious for its numerous output files
and its fixed-width tabular output, which is hard to parse (even
Bio.SearchIO.HmmerIOis struggling on some sequences). - efficient: Using
pyhmmerto launchhmmsearchon sequences and HMMs in disk storage is typically as fast as directly using thehmmsearchbinary (see the Benchmarks section).pyhmmer.hmmer.hmmsearchuses a different parallelisation strategy compared to thehmmsearchbinary from HMMER, which can help getting the most of multiple CPUs when annotating smaller sequence databases.
This library is still a work-in-progress. It follows semantic-versioning,
so API changes will be documented, but past v0.10 the API has been more or
less stable. It should already pack enough features to run biological analyses
or workflows involving hmmsearch, hmmscan, nhmmer, phmmer, hmmbuild
and hmmalign.
🔧 Installing
pyhmmer can be installed from PyPI,
which hosts some pre-built CPython wheels for Linux and MacOS on x86-64 and Arm64, as well as the code required to compile from source with Cython:
$ pip install pyhmmer
Compilation for UNIX PowerPC is not tested in CI, but should work out of the box. Note than non-UNIX operating systems (such as Windows) are not supported by HMMER.
A Bioconda package is also available:
$ conda install -c bioconda pyhmmer
See the Installation page
of the documentation to find other ways to install pyhmmer.
🔖 Citation
PyHMMER is scientific software, with a published paper in the Bioinformatics. Please cite both PyHMMER and HMMER if you are using it in an academic work, for instance as:
PyHMMER (Larralde et al., 2023), a Python library binding to HMMER (Eddy, 2011).
Detailed references are available on the Publications page of the online documentation.
📖 Documentation
A complete API reference can
be found in the online documentation, or
directly from the command line using
pydoc:
$ pydoc pyhmmer.easel
$ pydoc pyhmmer.plan7
💡 Example
Use pyhmmer to run hmmsearch to search for Type 2 PKS domains
(t2pks.hmm)
inside proteins extracted from the genome of Anaerococcus provencensis
(938293.PRJEB85.HG003687.faa).
This will produce an iterable over
TopHits that can be used for further sorting/querying in Python.
Processing happens in parallel using Python threads, and a TopHits
object is yielded for every HMM passed in the input iterable.
import pyhmmer
with pyhmmer.easel.SequenceFile("pyhmmer/tests/data/seqs/938293.PRJEB85.HG003687.faa", digital=True) as seq_file:
sequences = seq_file.read_block()
with pyhmmer.plan7.HMMFile("pyhmmer/tests/data/hmms/txt/t2pks.hmm") as hmm_file:
for hits in pyhmmer.hmmsearch(hmm_file, sequences, cpus=4):
print(f"HMM {hits.query.name} found {len(hits)} hits in the target sequences")
Have a look at more in-depth examples such as building a HMM from an alignment, analysing the active site of a hit, or fetching marker genes from a genome in the Examples page of the online documentation.
💭 Feedback
⚠️ Issue Tracker
Found a bug ? Have an enhancement request ? Head over to the GitHub issue tracker if you need to report or ask something. If you are filing in on a bug, please include as much information as you can about the issue, and try to recreate the same bug in a simple, easily reproducible situation.
🏗️ Contributing
Contributions are more than welcome! See CONTRIBUTING.md for more details.
⏱️ Benchmarks
Benchmarks were run on a [i7-10710U CPU](https://ark.intel.com/content/www
