SkillAgentSearch skills...

Caliby

Potts model-based protein sequence design

Install / Use

/learn @ProteinDesignLab/Caliby
About this skill

Quality Score

0/100

Category

Design

Supported Platforms

Universal

README

<img src="assets/caliby.jpg" alt="Caliby" width="175"/>

Caliby

Official repository for Caliby, a Potts model-based protein sequence design method that can condition on structural ensembles. For more details, read our preprint: Ensemble-conditioned protein sequence design with Caliby

This repository contains code for sequence design, ensemble generation with Protpardelle-1c, ensemble-conditioned sequence design, and sequence scoring.

Both this repository and Caliby are still under active development, so please reach out if you have any questions or feature requests! To re-train Caliby, training and dataset preprocessing code should mostly be provided within this repository, but we plan to update in the future with more detailed instructions.

<img src="assets/sampling_gif.gif" alt="Sequence design trajectory" width="600"/>

Table of Contents

Installation

Follow the below instructions for setting up the environment. After you've installed the environment, edit env_setup.sh to point to your environment directory, and run source env_setup.sh before running any scripts (see example scripts in examples/scripts).

Option 1: uv installation (preferred)

To run the scripts in this repository, we recommend using uv for package management. If you don't already have uv installed, follow the official installation instructions here.

Then, run the following commands to install the dependencies:

# Clone the repository.
git clone https://github.com/ProteinDesignLab/caliby.git
cd caliby

# Create and activate the environment.
ENV_DIR=envs  # or any other directory of your choice
mkdir -p ${ENV_DIR}
uv venv ${ENV_DIR}/caliby -p python3.12
source ${ENV_DIR}/caliby/bin/activate

# Install Caliby from the local checkout.
uv pip install -e .

Or install directly from GitHub without cloning:

uv pip install "git+https://github.com/ProteinDesignLab/caliby.git"

Option 2: Apptainer installation

If the above workflow does not work for you, and you instead need to run Caliby within an Apptainer container, first set ENV_DIR and IMG in build_apptainer.sh to the directory you want to use for the environment and the image path, respectively. Then, you can run ./build_apptainer.sh to download the container and install the environment within the container

After the setting up the container and environment, you can run Caliby scripts within the container by wrapping the script in apptainer exec --nv. For example, to run examples/scripts/seq_des.sh within the container, you can run:

IMG=${PWD}/containers/pytorch_24.12.sif
apptainer exec --nv \
  ${IMG} \
  bash -lc '
  source env_setup.sh
  ./examples/scripts/seq_des.sh
'

Download model weights

Model weights are hosted on HuggingFace. Weights are automatically downloaded on first run, so you can skip this step if you prefer. To pre-download all weights at once, run:

./download_model_params.sh

This will download the weights into the model_params/ directory (configurable via MODEL_PARAMS_DIR in env_setup.sh). These weights include the Caliby models, ProteinMPNN, and the Protpardelle-1c model.

We offer the following model checkpoints, specified via the ckpt_name_or_path argument:

Sequence design:

| Model | ckpt_name_or_path | Description | |-------|---------|-------------| | Caliby | caliby (default) | Default model trained on all chains in the PDB with 0.3Å Gaussian noise. Trained on monomers only. | | SolubleCaliby | soluble_caliby | Analog to SolubleMPNN (Goverde et al., 2024) trained by excluding all annotated transmembrane proteins. Trained on monomers only. | | SolubleCaliby v1 | soluble_caliby_v1 | SolubleCaliby trained on both monomers and interfaces |

Sidechain packing:

| Model | ckpt_name_or_path | Description | |-------|---------|-------------| | Caliby packer (0.0Å) | caliby_packer_000 | Sidechain packer trained with 0.0Å noise | | Caliby packer (0.1Å) | caliby_packer_010 | Sidechain packer trained with 0.1Å noise (recommended for most cases)| | Caliby packer (0.3Å) | caliby_packer_030 | Sidechain packer trained with 0.3Å noise |

If ckpt_name_or_path does not end with .ckpt, it is treated as a model name and automatically resolved and downloaded. If it ends with .ckpt, it is treated as a file path (e.g., ckpt_name_or_path=/path/to/custom_model.ckpt).

Usage

Sequence design

To design sequences for a set of PDBs, see examples/scripts/seq_des.sh. This script takes in a input_cfg.pdb_dir and will design sequences for all PDBs in the directory.

To design sequences for a subset of PDBs within the directory, see examples/scripts/seq_des_subset.sh. This script takes in a input_cfg.pdb_dir and a input_cfg.pdb_name_list (a list of filenames with extensions to use from the directory) and will design sequences for the PDBs specified in the list.

Backbone ensemble generation with Protpardelle-1c

We found that instead of designing on a static structure, running sequence design on synthetic ensembles generated by Protpardelle-1c partial dfifusion produces sequences that are both more diverse and more likely to be predicted by AlphaFold2 to fold into the target structure. To generate ensembles with Protpardelle-1c in a format compatible with Caliby, we have provided a script in examples/scripts/generate_ensembles.sh.

For each PDB provided in input_cfg.pdb_dir, this script will generate num_samples_per_pdb samples per PDB with Protpardelle-1c partial diffusion. For ensemble-conditioned sequence design, we recommend generating at least 32 samples per PDB, but 16 or 8 samples can also give good results.

Ensemble-conditioned sequence design

Sequence design with synthetic Protpardelle-1c ensembles

After you've generated ensembles with Protpardelle-1c, you can run ensemble-conditioned sequence design with examples/scripts/seq_des_ensemble.sh, which will run ensemble-conditioned sequence design on all ensembles in input_cfg.conformer_dir. You can use examples/scripts/seq_des_ensemble_subset.sh to run on a subset of the ensembles by providing a input_cfg.pdb_name_list file.

Providing your own ensembles

The Protpardelle-1c ensemble generation script described above will produce a directory structure that is compatible with the seq_des_ensemble.sh script, but if you want to provide your own ensembles, you should format your ensembles as described below.

Given a top-level directory passed into the seq_des_ensemble.sh script (e.g., cc95-epoch3490-sampling_partial_diffusion-ss1.0-schurn0-ccstart0.0-dx0.0-dy0.0-dz0.0-rewind150), each subdirectory is named <PDB_KEY>, representing one ensemble. Inside each ensemble subdirectory, the following files are expected:

  • Primary conformer: The original structure file (identified by <PDB_KEY>.pdb or <PDB_KEY>.cif) serving as the default representative conformer and is always included in the ensemble by default.
  • Additional conformers: All other .pdb or .cif files in the directory are treated as conformer files, which will be ordered by their natural alphabetical order with python's natsort library.

The sequence design script takes in a max_num_conformers argument, which defaults to 32 and will be used to determine the maximum number of conformers to include in the ensemble. By default, the primary conformer will always be included in the ensemble. Then, we choose additional conformers, in order, until we have max_num_conformers conformers in the ensemble or we run out of conformers. We recommend using 32 conformers, but 8 or 16 conformers can also give good results.

Scoring

All sequence design scripts automatically save the global energy of the sequences they design in the resulting seq_des_outputs.csv file. Additionally, Caliby can score sequences on a set of input PDBs via examples/scripts/score.sh. This will produce a score_outputs.csv file that contains a column U, which is the energy computed from the Potts model.

You can also score a sequence against an ensemble of backbones via examples/scripts/score_ensemble.sh, where the ensembles should be provided in the same format as described in the previous section. When scoring a sequence against an ensemble, the sequence corresponding to the primary conformer will be scored, and the sequences of the additional conformers will be ignored.

Multichain scoring: When providing sequences for multichain structures in score_inputs_csv, chains should be separa

Related Skills

View on GitHub
GitHub Stars84
CategoryDesign
Updated8d ago
Forks13

Languages

Python

Security Score

95/100

Audited on Mar 30, 2026

No findings