SkillAgentSearch skills...

ProtProfileMD

Protein Language Modeling beyond static folds reveals sequence-encoded flexibility

Install / Use

/learn @finnlueth/ProtProfileMD
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

ProtProfileMD: Protein Language Modeling beyond static folds reveals sequence-encoded flexibility

Manuscript: https://www.biorxiv.org/content/10.64898/2026.01.21.700698v1

Abstract

Motivation: Proteins function through motion. Yet, most discoveries still commence with static representations of protein structures. Here, we investigated the feasibility of leveraging protein dynamics to improve homology detection.

Results: We introduce ProtProfileMD, a sequence-to-3D-probability model that predicts, from an amino acid sequence, a profile of discrete structural representations capturing protein dynamics. We applied supervised parameter-efficient finetuning of the ProstT5 protein Language Model (pLM) to predict per-residue distributions over Foldseek's 3Di alphabet derived from motions observed in molecular dynamics. This original result reveals that the 3Di tokens, despite being coarse-grained descriptors of 3D structure, still offer sufficient resolution to capture aspects of conformational changes. This is evidenced by a correlation between fluctuations in the 3D protein structure over the course of a molecular dynamics trajectory and the entropy of 3Di states. Based on this insight, we introduce a proof-of-concept for making remote homology detection of proteins more sensitive by leveraging a protein's distinctive dynamic fingerprint captured by our model. Our method recovers flexibility signals with a fidelity that is biologically relevant, improving search and complementing protein structure predictions, for example, by flagging flexible, disordered, or other functionally relevant regions.

Availability and Implementation: ProtProfileMD is available at github.com/finnlueth/ProtProfileMD. The associated training data and model weights are available at https://huggingface.co/datasets/finnlueth/ProtProfileMD and https://huggingface.co/finnlueth/ProtProfileMD.

ProtProfileMD Schematic

Setup

Initialize the virtual environment with all dependencies. UV is required for installation.

[!NOTE] A CUDA capable device is recommended. CPU and MPS may work, but are not supported.

uv sync

source .venv/bin/activate

Profile Generation

Predict FlexProfiles for any single-line FASTA file. Path to input FASTA file, and output TSV file are required. Optionally, set batch size, and resume from (append to) existing tsv file.

python ./scripts/model_inference.py --input <INPUT AA FASTA FILE> --output <OUTPUT PROFILE TSV FILE> --resume_from_tsv True --batch_size 4

Foldseek Search with Profiles

In order to search using our profiles, we must convert the .tsv into a Foldseek database. If you do not already have a 3Di FASTA, one may be created from the profiles. Then the profiles and the 3Di fasta are converted into a profile database.

python ./src/protprofilemdanalysis/scripts-data/argmax_profiles.py --in_profile_path <INPUT PROFILE TSV FILE> --out_fasta_path <OUTPUT FASTA FILE> --subtract_background True

python ./src/protprofilemdanalysis/scripts-data/generate_foldseek_db.py <INPUT AA FASTA FILE> <INPUT 3Di FASTA FILE> <OUTPUT SEQUENCE DB> <TEMP DIR>

python ./src/protprofilemdanalysis/scripts-data/build_profiledb.py <INPUT PROFILE TSV> <INPUT SEQUENCE DB> <OUTPUT PRODFILE DB>

Now, we can use the profile database with Foldseek search (and other commands) as we would use any other Foldseek database.

foldseek search <PROFILE DATABASE> <SEQUENCE DATABASE> <ALIGNMENT DIR> <TEMP DIR>

Example

This is an easy-to-use example based on an abridged version of the SCOPe database.

python ./scripts/model_inference.py --input ./example/input_aa.fasta --output ./example/output_profile.tsv --resume_from_tsv True --batch_size 8

python ./src/protprofilemdanalysis/scripts-data/argmax_profiles.py --in_profile_path ./example/output_profile.tsv --out_fasta_path ./example/output_3di.fasta --subtract_background True

python ./src/protprofilemdanalysis/scripts-data/generate_foldseek_db.py ./example/input_aa.fasta ./example/output_3di.fasta ./example/alignResults/db_sequence/foldseekDB ./example/alignResults/tmp

python ./src/protprofilemdanalysis/scripts-data/build_profiledb.py ./example/output_profile.tsv ./example/alignResults/db_sequence/foldseekDB ./example/alignResults/db_profile

foldseek search ./example/alignResults/db_profile/foldseekDB_profile ./example/alignResults/db_sequence/foldseekDB ./example/alignResults/aln ./example/alignResults/tmp -s 9.5 --max-seqs 2000 -e 10
View on GitHub
GitHub Stars35
CategoryDevelopment
Updated17d ago
Forks4

Languages

Python

Security Score

90/100

Audited on Mar 17, 2026

No findings