Lobster
Lbster: Language models for Biological Sequence Transformation and Evolutionary Representation
Install / Use
/learn @prescient-design/LobsterREADME
LBSTER 🦞
Language models for Biological Sequence Transformation and Evolutionary Representation
<!-- [](https://codecov.io/gh/prescient-design/lobster) -->lobster is a "batteries included" language model library for proteins and other biological sequences. Led by Nathan Frey, Karina Zadorozhny, Taylor Joren, Sidney Lisanza, Aya Abdlesalam Ismail, Joseph Kleinhenz and Allen Goodman, with many valuable contributions from Contributors across Prescient Design, Genentech.
This repository contains training code and access to pre-trained language models for biological sequence data.
Usage
<!--- image credit: Amy Wang --> <p align="center"> <img src="https://raw.githubusercontent.com/prescient-design/lobster/refs/heads/main/assets/lobster.png" width=200px> </p> <details open><summary><b>Table of contents</b></summary>- Why you should use LBSTER
- Citations
- Install instructions
- Models
- Notebooks
- MCP Server
- Training and inference
- Reinforcement Learning with UME
- Contributing
Why you should use LBSTER <a name="why-use"></a>
- LBSTER is built for pre-training models quickly from scratch. It is "batteries included." This is most useful if you need to control the pre-training data mixture and embedding space, or want to experiment with novel pre-training objectives and fine-tuning strategies.
- LBSTER is a living, open-source library that will be periodically updated with new code and pre-trained models from the Frey Lab at Prescient Design, Genentech. The Frey Lab works on real therapeutic molecule design problems and LBSTER models and capabilities reflect the demands of real-world drug discovery campaigns.
- LBSTER is built with beignet, a standard library for biological research, and integrated with cortex, a modular framework for multitask modeling, guided generation, and multi-modal models.
- LBSTER supports concepts; we have a concept-bottleneck protein language model, CB-LBSTER, which supports 718 concepts.
Citations <a name="citations"></a>
If you use the code and/or models, please cite the relevant papers.
For the lbster code base cite: Cramming Protein Language Model Training in 24 GPU Hours
@article{Frey2024.05.14.594108,
author = {Frey, Nathan C. and Joren, Taylor and Ismail, Aya Abdelsalam and Goodman, Allen and Bonneau, Richard and Cho, Kyunghyun and Gligorijevi{\'c}, Vladimir},
title = {Cramming Protein Language Model Training in 24 GPU Hours},
elocation-id = {2024.05.14.594108},
year = {2024},
doi = {10.1101/2024.05.14.594108},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/early/2024/05/15/2024.05.14.594108},
eprint = {https://www.biorxiv.org/content/early/2024/05/15/2024.05.14.594108.full.pdf},
journal = {bioRxiv}
}
For the cb-lbster code base cite: Concept Bottleneck Language Models for Protein Design
@article{ismail2024conceptbottlenecklanguagemodels,
title={Concept Bottleneck Language Models For protein design},
author={Aya Abdelsalam Ismail and Tuomas Oikarinen and Amy Wang and Julius Adebayo and Samuel Stanton and Taylor Joren and Joseph Kleinhenz and Allen Goodman and Héctor Corrada Bravo and Kyunghyun Cho and Nathan C. Frey},
year={2024},
eprint={2411.06090},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2411.06090},
}
Install <a name="install"></a>
Install uv and run
uv sync
Optional depedency groups
For different optional dependencies, run
uv sync --extra <group name 1> --extra <group name 2>
where <group name i> can be one of
struct-gpu,struct-cpufor Latent Generator dependencies for GPU or CPU respectivelymgmfor UME dependenciesflashfor flash-attention on GPUmcpfor MCP serverstrlfor transformer reinforcement learning
Recommended installation of all optional dependencies on a CPU:
uv sync --extra mgm --extra mcp --extra struct-cpu --extra trl
Recommended installation of all optional dependencies on a GPU:
uv sync --extra mgm --extra mcp --extra struct-gpu --extra trl
To use the environement, you can run either activate the environment...
source .venv/bin/activate
python -c "import lobster"
lobster_train data.path_to_fasta="test_data/query.fasta"
... or run with uv run:
uv run python -c "import lobster"
uv run lobster_train data.path_to_fasta="test_data/query.fasta"
Main models you should use <a name="main-models"></a>
Pretrained Models
Masked LMs
| Shorthand | #params | Dataset | Description | Model checkpoint | |---------|------------|---------|------------------------------------------------------------|-------------| Lobster_24M | 24 M | uniref50 | 24M parameter protein Masked LLM trained on uniref50| lobster_24M Lobster_150M | 150 M | uniref50 | 150M parameter protein Masked LLM trained on uniref50|lobster_150M
CB LMs
| Shorthand | #params | Dataset | Description | Model checkpoint | |---------|------------|---------|------------------------------------------------------------|-------------| cb_Lobster_24M | 24 M | uniref50+SwissProt | 24M parameter a protein concept bottleneck model for proteins with 718 concepts | cb_lobster_24M cb_Lobster_150M | 150 M | uniref50+SwissProt |150M parameter a protein concept bottleneck model for proteins with 718 concepts|cb_lobster_150M cb_Lobster_650M | 650 M | uniref50+SwissProt |650M parameter a protein concept bottleneck model for proteins with 718 concepts|cb_lobster_650M cb_Lobster_3B | 3 B | uniref50+SwissProt |3B parameter a protein concept bottleneck model for proteins with 718 concepts|cb_lobster_3B
Loading a pre-trained model
from lobster.model import LobsterPMLM, LobsterPCLM, LobsterCBMPMLM
masked_language_model = LobsterPMLM("asalam91/lobster_24M")
concept_bottleneck_masked_language_model = LobsterCBMPMLM("asalam91/cb_lobster_24M")
causal_language_model = LobsterPCLM.load_from_checkpoint(<path to ckpt>)
3D, cDNA, and dynamic models use the same classes.
Models
- LobsterPMLM: masked language model (BERT-style encoder-only architecture)
- LobsterCBMPMLM: concept bottleneck masked language model (BERT-style encoder-only architecture with a concept bottleneck and a linear decoder)
- LobsterPCLM: causal language model (Llama-style decoder-only architecture)
- LobsterPLMFold: structure prediction language models (pre-trained encoder + structure head)
Notebooks <a name="notebooks"></a>
Representation learning
Check out this jupyter notebook tutorial for an example on how to extract embedding reprsentations from different models.
Concept Interventions
Check out this jupyter notebook tutorial for an example on how to intervene on different concepts for our concept-bottleneck models class.
MCP Integration <a name="mcp-integration"></a>
Lobster supports Model Context Protocol (MCP) for seamless integration with Claude Desktop, Cursor, and other AI tools:
# Install with MCP support
uv sync --extra mcp
# Setup Claude Desktop integration
uv run lobster_mcp_setup
Setup for Cursor
Option 1: One-Click Install (Recommended)
Click the button above to automatically add the Lobster MCP server to Cursor.
Requirements:
- Cursor installed
- [uv](https://docs.astral.sh/uv/
