ProteomeLM

ProteomeLM: A proteome-scale language model allowing fast prediction of protein-protein interactions and gene essentiality across taxa

Generate Convert Improve

Install / Use

/learn @Bitbol-Lab/ProteomeLM

About this skill

Quality Score

0/100

README

ProteomeLM: A proteome-scale language model allowing fast prediction of protein-protein interactions and gene essentiality across taxa

Paper | Models | Dataset

</div>

ProteomeLM Overview

Overview

ProteomeLM is a transformer-based language model that reasons on entire proteomes from species spanning the tree of life. Unlike existing protein language models that operate on individual sequences, ProteomeLM learns contextualized protein representations by leveraging the functional constraints present at the proteome scale.

Key Contributions

Proteome-scale modeling: First language model to process entire proteomes across eukaryotes and prokaryotes, capturing inter-protein dependencies and functional constraints
Ultra-fast PPI screening: Screens whole interactomes orders of magnitude faster than classic coevolution-based methods, enabling proteome-wide interaction analysis
State-of-the-art performance: Achieves superior results on protein-protein interaction prediction across species and benchmarks through attention-based interaction detection
Gene essentiality prediction: Novel capability to predict essential genes generalizing across diverse taxa
Attention-based insights: Spontaneously captures protein-protein interactions in attention coefficients without explicit training on interaction data
Hierarchical learning: Leverages OrthoDB taxonomic hierarchy for structured representation learning across the tree of life

🚀 Quick Start

Installation

# Clone the repository
git clone https://github.com/Bitbol-Lab/ProteomeLM.git
cd ProteomeLM

# Create and activate environment
python3 -m venv venv
source venv/bin/activate  # Linux/Mac
# venv\Scripts\activate  # Windows

# Install dependencies
pip install -r requirements.txt

🤗 Pre-trained Models

All ProteomeLM models are available on Hugging Face Hub. Choose the appropriate model size for your use case:

| Model | Parameters | Size | Hugging Face | Description | |-------|------------|------|--------------|-------------| | ProteomeLM-XS | 5.66M | 11.3MB | Bitbol-Lab/ProteomeLM-XS | Ultra-lightweight for quick inference | | ProteomeLM-S | 36.9M | 73.8MB | Bitbol-Lab/ProteomeLM-S | Small model balancing speed and accuracy | | ProteomeLM-M | 112M | 225MB | Bitbol-Lab/ProteomeLM-M | Medium model for most applications (can't fit biggest proteomes) | | ProteomeLM-L | 328M | 656MB | Bitbol-Lab/ProteomeLM-L | Large model for maximum performance (can fit biggest proteomes) |

Training Dataset

The training dataset is also available on Hugging Face:

ProteomeLM-dataset: Preprocessed OrthoDB embeddings and hierarchical data

Repository Structure

ProteomeLM/
├── 📄 __init__.py                 # Package initialization
├── 📄 setup.py                    # Package setup script
├── 📋 requirements.txt            # Python dependencies
├── 📄 LICENSE                     # Apache 2.0 license
├── 📄 README.md                   # Project documentation
├── 📄 paper.pdf                   # Research paper
├── 🐳 Dockerfile                  # Container configuration
├── 📁 configs/                    # Training configuration files
│   └── proteomelm.yaml           # Base configuration
├── 📁 proteomelm/                # Core model implementation
│   ├── __init__.py              # Package initialization
│   ├── cli.py                   # Command-line interface
│   ├── config_manager.py        # Configuration management
│   ├── modeling_proteomelm.py    # ProteomeLM model architecture
│   ├── trainer.py               # Custom training logic
│   ├── train.py                 # Training functions
│   ├── dataloaders.py           # Data loading utilities
│   ├── encode_dataset.py        # Dataset encoding
│   ├── utils.py                 # Utility functions
│   └── ppi/                     # PPI-specific components
│       ├── __init__.py          # Package initialization
│       ├── config.py            # PPI configuration
│       ├── data_processing.py   # Data preprocessing
│       ├── evaluation.py        # Performance evaluation
│       ├── experiment_runner.py  # Experiment management
│       ├── feature_extraction.py # Feature engineering
│       ├── main.py              # Main PPI runner
│       ├── model.py             # PPI models
│       └── utils.py             # PPI utilities
├── 📁 experiments/              # Research experiments
│   ├── __init__.py              # Package initialization
│   ├── fast_orthodb_matching.py # Ortholog matching utilities
│   ├── nb_plots.ipynb           # Analysis notebook
│   └── interactomes/            # Interactome analysis
│       ├── human.ipynb          # Human interactome analysis
│       └── pathogens.ipynb      # Pathogen interactome analysis
├── 📁 notebooks/                # Analysis notebooks
│   ├── ppi_prediction.ipynb     # PPI prediction notebook
│   └── notebooks_utils.py       # Notebook utilities
├── 📁 weights/                  # Pre-trained model weights
│   ├── ProteomeLM-XS/           # Extra small model weights
│   ├── ProteomeLM-S/            # Small model weights
│   ├── ProteomeLM-M/            # Medium model weights
│   └── ProteomeLM-L/            # Large model weights
├── 📁 data/                     # Data storage
│   ├── interactomes/            # Interaction data
│   │   ├── logistic_regression_model_human.pkl
│   │   └── logistic_regression_model_pathogens.pkl
│   └── orthodb12_raw/           # OrthoDB raw data
│       ├── odb12v0_aa.fasta.gz  # Amino acid sequences
│       ├── odb12v0_OG2genes.tab # Gene-ortholog mapping
│       └── odb12v0_OG_pairs.tab # Ortholog pairs
└── 📁 img/                      # Documentation images
    └── main_fig.png             # Main figure

🔧 Usage

Quick Start: Fast PPI prediction

For interactive PPI prediction with multiple data sources, use our comprehensive Jupyter notebook:

# Launch the interactive PPI prediction notebook
jupyter notebook notebooks/ppi_prediction.ipynb

Open Notebook

The notebook provides a flexible framework supporting:

Data Sources:

Local FASTA files: Upload your own protein sequences
STRING database: Download sequences by organism ID (e.g., "9606" for human)
UniProt database: Download sequences by taxon ID
UniProt IDs: Fetch specific protein sequences by accession

Key Features:

Automated ProteomeLM feature extraction using attention mechanisms
Pre-trained logistic regression models for PPI prediction
STRING annotation comparison and evaluation
Comprehensive visualization and analysis

Gene Essentiality Prediction

TODO

Training ProteomeLM

Train a new model from scratch or fine-tune existing weights:

# Using the CLI interface
python -m proteomelm.cli train --config configs/proteomelm.yaml

# Multi-GPU distributed training
torchrun --nproc_per_node=4 -m proteomelm.cli train \
    --config configs/proteomelm.yaml \
    --distributed

# Fine-tune from Hugging Face model
python -m proteomelm.cli train --config configs/proteomelm.yaml --pretrained Bitbol-Lab/ProteomeLM-M \

# Advanced training with custom parameters
python -m proteomelm.cli train --config configs/proteomelm.yaml

Docker Deployment

For containerized execution:

# Build container
docker build -t proteomelm:latest .

# Run training
docker run --gpus all -v $(pwd):/workspace proteomelm:latest \
    python train.py --config configs/proteomelm.yaml

Loading Models

# From Hugging Face Hub (recommended)
from proteomelm import ProteomeLMForMaskedLM

model_xs = ProteomeLMForMaskedLM.from_pretrained("Bitbol-Lab/ProteomeLM-XS")
model_s = ProteomeLMForMaskedLM.from_pretrained("Bitbol-Lab/ProteomeLM-S") 
model_m = ProteomeLMForMaskedLM.from_pretrained("Bitbol-Lab/ProteomeLM-M")
model_l = ProteomeLMForMaskedLM.from_pretrained("Bitbol-Lab/ProteomeLM-L")

# From local weights (after git clone)
model = ProteomeLMForMaskedLM.from_pretrained("weights/ProteomeLM-M")

Citation

If you use ProteomeLM in your research, please cite our paper:

@article{malbranke2025proteomelm,
  title={ProteomeLM: A proteome-scale language model allowing fast prediction of protein-protein interactions and gene essentiality across taxa},
  author={Malbranke, Cyril and Zalaffi, Gionata Paolo and Bitbol, Anne-Florence},
  journal={bioRxiv},
  pages={2025.08.01.668221},
  year={2025},
  publisher={Cold Spring Harbor Laboratory},
  doi={10.1101/2025.08.01.668221},
  url={https://www.biorxiv.org/content/10.1101/2025.08.01.668221v1}
}

License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.

Acknowledgments

[EvolutionaryScale](

Related Skills

node-connect

349.0k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

109.4k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

349.0k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

349.0k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。