ProteinWorkshop
Benchmarking framework for protein representation learning. Includes a large number of pre-training and downstream task datasets, models and training/task utilities. (ICLR 2024)
Install / Use
/learn @a-r-j/ProteinWorkshopREADME
Protein Workshop
<a href="https://github.com/psf/black"><img alt="Code style: black" src="https://img.shields.io/badge/code%20style-black-000000.svg"></a>

This repository provides the code for the protein structure representation learning benchmark detailed in the paper Evaluating Representation Learning on the Protein Structure Universe (ICLR 2024).
In the benchmark, we implement numerous featurisation schemes, datasets for self-supervised pre-training and downstream evaluation, pre-training tasks, and auxiliary tasks.
The benchmark can be used as a working template for a protein representation learning research project, a library of drop-in components for use in your projects, or as a CLI tool for quickly running protein representation learning evaluation and pre-training configurations.
Processed datasets and pre-trained weights are made available. Downloading datasets is not required; upon first run all datasets will be downloaded and processed from their respective source.
Configuration files to run the experiments described in the manuscript are provided in the proteinworkshop/config/sweeps/ directory.
Contents
- Protein Workshop
Installation
Below, we outline how one may set up a virtual environment for proteinworkshop. Note that these installation instructions currently target Linux-like systems with NVIDIA CUDA support. Note that Windows and macOS are currently not officially supported.
From PyPI
proteinworkshop is available for install from PyPI. This enables training of specific configurations via the CLI or using individual components from the benchmark, such as datasets, featurisers, or transforms, as drop-ins to other projects. Make sure to install PyTorch (specifically version 2.1.2 or newer) using its official pip installation instructions, with CUDA support as desired.
# install `proteinworkshop` from PyPI
pip install proteinworkshop
# install PyTorch Geometric using the (now-installed) CLI
workshop install pyg
# set a custom data directory for file downloads; otherwise, all data will be downloaded to `site-packages`
export DATA_PATH="where/you/want/data/" # e.g., `export DATA_PATH="proteinworkshop/data"`
However, for full exploration we recommend cloning the repository and building from source.
Building from source
With a local virtual environment activated (e.g., one created with conda create -n proteinworkshop python=3.10):
-
Clone and install the project
git clone https://github.com/a-r-j/ProteinWorkshop cd ProteinWorkshop pip install -e . -
Install PyTorch (specifically version
2.1.2or newer) using its officialpipinstallation instructions, with CUDA support as desired# e.g., to install PyTorch with CUDA 11.8 support on Linux: pip install torch==2.1.2+cu118 torchvision==0.16.2+cu118 torchaudio==2.1.2+cu118 --index-url https://download.pytorch.org/whl/cu118 -
Then use the newly-installed
proteinworkshopCLI to install PyTorch Geometricworkshop install pyg -
Configure paths in
.env(optional, will override default paths if set). See.env.examplefor an example. -
Download PDB data:
python proteinworkshop/scripts/download_pdb_mmtf.py
Tutorials
We provide a five-part tutorial series of Jupyter notebooks to provide users with examples
of how to use and extend proteinworkshop, as outlined below.
- Training a new model
- Customizing an existing dataset
- Adding a new dataset
- Adding a new model
- Adding a new task
Quickstart
Downloading datasets
Datasets can either be built from the source structures or downloaded from Zenodo. Datasets will be built from source the first time a dataset is used in a run (or by calling the appropriate setup() method in the corresponding datamodule). We provide a CLI tool for downloading datasets:
workshop download <DATASET_NAME>
workshop download pdb
workshop download cath
workshop download afdb_rep_v4
# etc..
If you wish to build datasets from source, we recommend first downloading the entire PDB first (in MMTF format, c. 24 Gb) to reuse shared PDB data as much as possible:
workshop download pdb
# or
python proteinworkshop/scripts/download_pdb_mmtf.py
Training a model
Launching an experiment minimally requires specification of a dataset, structural encoder, and task (devices can be specified with trainer=cpu/gpu):
workshop train dataset=cath encoder=egnn task=inverse_folding trainer=cpu env.paths.data=where/you/want/data/
# or
python proteinworkshop/train.py dataset=cath encoder=egnn task=inverse_folding trainer=cpu # or trainer=gpu
This command uses the default configurations in configs/train.yaml, which can be overwritten by equivalently named options. For instance, you can use a different input featurisation using the features option, or set the display name of your experiment on wandb using the name option:
workshop train dataset=cath encoder=egnn task=inverse_folding features=ca_bb name=MY-EXPT-NAME trainer=cpu env.paths.data=where/you/want/data/
# or
python proteinworkshop/train.py dataset=cath encoder=egnn task=inverse_folding features=ca_bb name=MY-EXPT-NAME trainer=cpu # or trainer=gpu
Finetuning a model
Finetuning a model additionally requires specification of a checkpoint.
workshop finetune dataset=cath encoder=egnn task=inverse_folding ckpt_path=PATH/TO/CHECKPOINT trainer=cpu env.paths.data=where/you/want/data/
# or
python proteinworkshop/finetune.py dataset=cath encoder=egnn task=inverse_folding ckpt_path=PATH/TO/CHECKPOINT trainer=cpu # or trainer=gpu
Running a sweep/experiment
We can make use of the hydra wandb sweeper plugin to configure experiments as sweeps, allowing searches over hyperparameters, arc
Related Skills
best-practices-researcher
The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app
groundhog
399Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).
codebase-to-course
Turn any codebase into a beautiful, interactive single-page HTML course that teaches how the code works to non-technical people. Use this skill whenever someone wants to create an interactive course, tutorial, or educational walkthrough from a codebase or project. Also trigger when users mention 'turn this into a course,' 'explain this codebase interactively,' 'teach this code,' 'interactive tutorial from code,' 'codebase walkthrough,' 'learn from this codebase,' or 'make a course from this project.' This skill produces a stunning, self-contained HTML file with scroll-based navigation, animated visualizations, embedded quizzes, and code-with-plain-English side-by-side translations.
academic-pptx
Use this skill whenever the user wants to create or improve a presentation for an academic context — conference papers, seminar talks, thesis defenses, grant briefings, lab meetings, invited lectures, or any presentation where the audience will evaluate reasoning and evidence. Triggers include: 'conference talk', 'seminar slides', 'thesis defense', 'research presentation', 'academic deck', 'academic presentation'. Also triggers when the user asks to 'make slides' in combination with academic content (e.g., 'make slides for my paper on X', 'create a presentation for my dissertation defense', 'build a deck for my grant proposal'). This skill governs CONTENT and STRUCTURE decisions. For the technical work of creating or editing the .pptx file itself, also read the pptx SKILL.md.
