ProtHGT
A Heterogeneous Graph Transformer (HGT)-based model for protein function prediction using biological knowledge graphs and protein language models
Install / Use
/learn @HUBioDataLab/ProtHGTREADME
ProtHGT: Heterogeneous Graph Transformers for Automated Protein Function Prediction Using Knowledge Graphs and Language Models
The rapid accumulation of protein sequence data, coupled with the slow pace of experimental annotations, creates a critical need for computational methods to predict protein functions. Existing models often rely on limited data types, such as sequence-based features or protein-protein interactions (PPIs), failing to capture the complex molecular relationships in biological systems. To address this, we developed ProtHGT, a heterogeneous graph transformer-based model that integrates diverse biological datasets into a unified framework using knowledge graphs for accurate and interpretable protein function prediction. ProtHGT achieves state-of-the-art performance on benchmark datasets, demonstrating its ability to outperform current graph-based and sequence-based approaches. By leveraging diverse biological entity types and highly representative protein language model embeddings at the input level, the model effectively learns complex biological relationships, enabling accurate predictions across all Gene Ontology (GO) sub-ontologies. Ablation analyses highlight the critical role of heterogeneous data integration in achieving robust predictions. Finally, our use-case study has indicated that it’s possible to interpret ProtHGT’s predictions via exploring the related parts of our input biological knowledge graph, offering plausible explanations to build or test new hypotheses.
|
|
|:--:|
| Schematic representation of the ProtHGT framework. a) Diverse biological datasets, including proteins, pathways, domains, and GO terms, are integrated into a unified knowledge graph; b) the heterogeneous graph is constructed, capturing multi-relational biological associations; c) feature vectors for each node type are generated using state-of-the-art embedding methods; d) protein function prediction models are trained separately for molecular function, biological process, and cellular component sub-ontologies; e) heterogeneous graph transformer (HGT) layers process and refine node representations through multi-relational message passing. Final protein function predictions are obtained by linking proteins to GO terms based on learned embeddings and attention-weighted relationships. |
Content <!-- omit in toc -->
- The Architecture of ProtHGT
- Repository Structure
- Getting Started
- Training the ProtHGT Model
- Making Predictions
- Publication
- License
The Architecture of ProtHGT
ProtHGT builds upon the Heterogeneous Graph Transformer (HGT) architecture, consisting of multiple stacked transformer layers to refine node embeddings while preserving node-type and edge-type diversity.
1. Input Feature Transformation
Each node type (e.g., Protein, GO Term) is projected into a shared hidden space using independent linear transformations:
self.lin_dict = torch.nn.ModuleDict({
node_type: Linear(data.x_dict[node_type].size(-1), hidden_channels)
for node_type in data.node_types
})
This ensures that different biological entities have their own representation before message passing.
2. Heterogeneous Graph Transformer Layers
ProtHGT applies multiple HGT layers (HGTConv implementation from PyTorch Geometric) to propagate information across node types using multi-head attention:
self.convs = torch.nn.ModuleList()
for _ in range(num_layers):
conv = HGTConv(hidden_channels, hidden_channels, data.metadata(), num_heads, group='sum')
self.convs.append(conv)
3. Protein Function Prediction (Link Prediction)
ProtHGT models protein function prediction as a link prediction task between Protein and GO Term nodes. The final embeddings of the two nodes are concatenated and passed through an MLP for classification:
row, col = tr_edge_label_index
z = torch.cat([x_dict["Protein"][row], x_dict[target_type][col]], dim=-1)
return self.mlp(z).view(-1), x_dict
The MLP predicts the probability of a functional association between a protein and a GO term.
Repository Structure
- data/: Contains the knowledge graph data required for training and evaluating ProtHGT.
Data files can be downloaded from Hugging Face and must be placed in this directory fortrain.pyandpredict.pyto function properly. For more details, please refer to the data/README.md. - models/: Contains trained models for each Gene Ontology (GO) category. It includes models trained with the default TAPE protein embeddings as well as alternative protein representations (e.g., ESM2, ProtT5). All models are trained on their own corresponding knowledge graph (KG) datasets—be sure to select the appropriate model that matches the protein embedding type you intend to use.
- configs/: Contains configuration files specifying optimized model hyperparameters and training settings. There are configuration files both for TAPE-based KG datasets (optimized for those embeddings) and for datasets using alternative protein embeddings (e.g., ESM2, ProtT5). Select the configuration that matches the embedding type of your chosen model.
- src/: Main source code directory
model.py: Implementation of the ProtHGT architecturetrain.py: Script for training the modelpredict.py: Script for generating predictionsutils.py: Helper functions and utilitiesdata_loader.py: Helper functions for data loading and preprocessing
- requirements.txt: Lists all Python package dependencies
Getting Started
We highly recommend you to use conda platform for installing dependencies properly. After installation of appropriate conda version for your operating system, create and activate conda environment with dependencies as below:
conda create -n prothgt
conda activate prothgt
Then, install the dependencies using the requirements.txt file:
pip install -r requirements.txt
Training the ProtHGT Model
For training the ProtHGT model, run the train.py script with the following example command:
python train.py --train-data ../data/prothgt-train-graph.pt --val-data ../data/prothgt-val-graph.pt --test-data ../data/prothgt-test-graph.pt --target-type GO_term_F --config ../configs/prothgt-config-molecular-function.yaml
Arguments:
--train-data: Path to the training data file--val-data: Path to the validation data file--test-data: Path to the test data file--target-type: Target prediction type. It can be one of the following:GO_term_Ffor molecular function,GO_term_Pfor biological process, andGO_term_Cfor cellular component.--config: Path to the configuration file. You can use your own or select from the optimized hyperparameter configurations inconfigs/directory.--output-dir: Path to the output directory. Default is../outputs.--checkpoint-dir: Path to the checkpoint directory. Default is None.--num-workers: Number of workers for data loading. Default is 2.
Before running the training script, make sure that the data files are correctly placed in the data/ directory.
Making Predictions
To generate function predictions for a given protein list using ProtHGT, you can either use our web-service here: ProtHGT Web-Service or run the predict.py script with the following example command. This script uses pre-trained ProtHGT models—currently trained with TAPE embeddings as the default protein representations—available in the models/ directory. Models trained with alternative protein embeddings will be provided in future releases.
python predict.py --protein_ids ..data/example_protein_ids.txt --protein_embedding tape --go_category all
Arguments:
--protein_ids: You can either provide a text file containing a list of protein IDs or a comma-separated string of protein IDs.--protein_embedding: Protein embedding to use. It can be one of the following:tape,prott5, oresm2.--go_category: GO category to predict. It can be one of the following:all,molecular_function,biological_process, orcellular_component.--output_dir: Path to the output directory. Default is../predictions.--batch_size: Number of proteins to process in each batch. Default is 100.--threshold: Threshold for filtering predictions. Default is 0.0.--top_k: Keep only top-k GO terms per protein (0 = keep all). Default is 0.
The output file is a csv file containing the following columns:
Protein: UniProt IDGO_term: GO term IDGO_category: GO term category. EitherMolecular Function,Biological Process, orCellular Component.Probability: Probability of the prediction.
Note: Currently, ProtHGT can only generate predictions for proteins that exist in our knowledge graph, which includes over 300,000 UniProtKB/Swiss-Prot proteins. To enable predictions for novel proteins from their sequences, we are developing a real-time data retrieval system that dynamically fetches relational data from external sources (e.g., STRING, Reactome) and constructs a customized knowledge graph for inference. This system will allow ProtHGT to predict functions for previou
Related Skills
YC-Killer
2.7kA library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.
best-practices-researcher
The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app
groundhog
398Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).
isf-agent
a repo for an agent that helps researchers apply for isf funding
