BoPep: Navigating protein sequence landscape with Bayesian optimization

This repository contains the code for BoPep, a method suite for identifying and generating proteins and peptides using Bayesian Optimization (BO). While it started as a method specialized for peptide-binder optimization (the Pep in BoPep), it has now expanded to include proteins and different objectives (not necessarily related to binding). The strength behind BoPep is how it is developed in modules. More information on the modules can be found below. To string these modules together, we have currently implemented three different methods in the BoPep framework:

BoPep search: This is the core method for BoPep, which uses Bayesian Optimization to navigate through large datasets in search for proteins that optimize some objective. It currently supports two types of searches: a peptidome search in which a pre-defined peptide-list is provided as the search space, and a proteome search, in which a protein list is provided as a search space. For the proteome search, peptides are sampled from the proteome.
BoRF: A design module for generating a large dataset using a diffusion pipeline, which is then searched with BoPep.
BoGA: A module which allows you to generate proteins using a surrogate model guided evolutionary algorithm. In the literature, these types of methods have been referred to as "machine learning-guided directed evolution" or "surrogate-assisted evolutionary algorithms".

BoPep search and BoRF are showcased in the BoPep preprint and the genetic algorithm is showcased in the BoGA preprint.

NOTE: We are currently working on updating the documentation for BoPep with more instructive examples and API references. All current examples can be found in /examples.

Installation

To run bopep locally, you will need to clone this repository, as well as dependencies which vary based on what you want to optimize for. If you wish to use AlphaFold and/or Boltz, you will need to install LocalColabFold and/or Boltz. PyRosetta is always needed as well as other dependencies included in requirements.txt. Follow the steps below to set up your environment:

Step 1: Clone the Repository

First, clone the repository to your local machine (not available on pip yet):

git clone https://github.com/ErikHartman/bopep.git
cd bopep

Step 2: Set Up a Virtual Environment

It’s recommended to set up a virtual environment to keep dependencies isolated:

python -m venv bopep_env # Or python3
source bopep_env/bin/activate

Step 3: Install Dependencies

(a) If you would like to dock/fold with AlphaFold: Install LocalColabFold: LocalColabFold is a fantastic package that allows you to run ColabFold locally. Follow the installation procedure here to install it.

Remember to export the PATH variable and make sure colabfold_batch is callable by running:

# For bash or zsh
# e.g. export PATH="/home/moriwaki/Desktop/localcolabfold/colabfold-conda/bin:$PATH"
export PATH="/path/to/your/localcolabfold/colabfold-conda/bin:$PATH"

colabfold_batch --help

This should work if you follow the instructions in the LocalColabFold git repo.

(b) If you would like to dock with Boltz: Install Boltz: Boltz-2 is another nice way of docking through co-folding. Follow the installation procedure here to install Boltz-2. Make sure you can run boltz after installing.
Install PyRosetta: PyRosetta is freely available for academic users and is used to score complexes. Any commercial usage requires the purchasing of a license.
- Go to the PyRosetta download page and read up on the terms for the license.
- Install PyRosetta in your environment using the pyrosetta-installer with pip:
```
pip install pyrosetta-installer
```
- Then run:
```
python -c 'import pyrosetta_installer; pyrosetta_installer.install_pyrosetta(skip_if_installed=True)'
```
OPTIONAL: Install RFdiffusion and ProteinMPNN: If you wish to run BoRF you need to install RFdiffusion and ProteinMPNN. To install these packages, follow the instructions in their respective repositories: ProteinMPNN and RFdiffusion.
Install Remaining Dependencies: Finally, install any additional dependencies required for bopep:
```
pip install -r requirements.txt
```

Modules

BoPep is modular, and the different modules can be strung together in different algorithms. The main modules entail surrogate modelling, structure prediction, scoring (objective functions) and embedding.

Surrogate modelling

The task of the surrogate model is to learn a mapping between the embedding and the score. We use probibalistic deep learning models to do so. The architectures include: BiGRU, BiLSTM and MLP.

The probibalistic modalities include: an ensemble of networks, MC dropout, deep evidential regression and mean-variance estimation.

Structure prediction

Docking is performed to predict the complex structure given a target and a sequence. We support docking via co-folding using AlphaFold 2 implemented via ColabFold and/or Boltz-2. For monomer structure prediction, we support AlphaFold 2 implemented via ColabFold.

Scoring

Scoring of complexes defines the fitness/reward which is maximized during search. When searching for binders, it should correlate with binding probability/affinity. We have defined a score called benchmark_objective_v1 using data from PDBBind and symbolic regression. You can also provide your own objective function easily.

Embedding

Embedding serves to create a navigatable space of seqeunce representations. We support embedding through AAindex and ESM-2. You can also create your own embeddings and pass to the BO search function.

Algorithms

Search datasets

The core of the BoPep framework lies in searching large datasets for binders. It does so by training a surrogate model to predict whether a seqeuence will bind or not to a given target, based on embeddings and previous docking experiments. During the search, the sequence dataset is navigated with the help of surrogate models.

Protein design with RFdiffusion + Relax + ProteinMPNN, followed by search (BoRF)

We also provide a way to generate large candidate datasets using an RFdiffusion + ProteinMPNN + FastRelax pipeline. By sampling lengths and hotspots we generate a large diverse dataset of candidate sequences. We can then apply the BoPep search on the datasets, leading to less computationally expensive design of binders.

Protein design with machine learning guided evolutionary optimization

The surrogate models can also be leveraged for protein design, in an evolutionary optimization loop. The BoGA algorithm uses the surrogate model to prioritize candidates generated by the mutation-algorithm, improving the efficiency of the evolutionary search.

Changelog

We're currently keeping an informal changelog here. As the project goes public we will switch to best practices.

27 Nov 2025 $\rightarrow$ current

I consider this version 1 due to major changes in the workflow. Here, the package expanded to sequence optimization OVERALL. This included unconditional and sequence generation. As such, many APIs and variables were changed from "peptide" to "sequence". Some examples in /examples may still be outdated and be called "peptide" or "pep".

Beginning $\rightarrow$ 27 Nov 2025

I consider this version 0. The main experiments were run to showcase BoPep and BoGA. This version focused on peptide binder searching and design (hence the name BoPep).

Cite

If you use BoPep, please cite:

@article{Hartman2025,
  title = {Navigating the peptide sequence space in search for peptide binders with BoPep},
  url = {http://dx.doi.org/10.1101/2025.01.20.633551},
  DOI = {10.1101/2025.01.20.633551},
  publisher = {openRxiv},
  author = {Hartman,  Erik and Samsudin,  Firdaus and Siljehag Alencar,  Malcolm and Tang,  Di and Bond,  Peter J and Schmidtchen,  Artur and Malmstrom,  Johan},
  year = {2025},
}

If you use BoGA, please cite:

@misc{Hartman2026,
  doi = {10.48550/ARXIV.2603.02753},
  url = {https://arxiv.org/abs/2603.02753},
  author = {Hartman,  Erik and Tang,  Di and Malmstr\"{o}m,  Johan},
  title = {Deep learning-guided evolutionary optimization for protein design},
  publisher = {arXiv},
  year = {2026},
}

Additionally, please cite the relevant papers below for your use case:

@article{Mirdita2022,
  title = {ColabFold: making protein folding accessible to all},
  volume = {19},
  ISSN = {1548-7105},
  url = {http://dx.doi.org/10.1038/s41592-022-01488-1},
  DOI = {10.1038/s41592-022-01488-1},
  number = {6},
  journal = {Nature Methods},
  publisher = {Springer Science and Business Media LLC},
  author = {Mirdita,  Milot and Sch\"{u}tze,  Konstantin and Moriwaki,  Yoshitaka and Heo,  Lim and Ovchinnikov,  Sergey and Steinegger,  Martin},
  year = {2022},
  month = may,
  pages = {679–682}
}

@article{Evans2021,
  title = {Protein complex prediction with AlphaFold-Multimer},
  url = {http://dx.d

Bopep

Install / Use

README