SkillAgentSearch skills...

ProtHGT

A Heterogeneous Graph Transformer (HGT)-based model for protein function prediction using biological knowledge graphs and protein language models

Install / Use

/learn @HUBioDataLab/ProtHGT

README

ProtHGT: Heterogeneous Graph Transformers for Automated Protein Function Prediction Using Knowledge Graphs and Language Models

preprint tool license

The rapid accumulation of protein sequence data, coupled with the slow pace of experimental annotations, creates a critical need for computational methods to predict protein functions. Existing models often rely on limited data types, such as sequence-based features or protein-protein interactions (PPIs), failing to capture the complex molecular relationships in biological systems. To address this, we developed ProtHGT, a heterogeneous graph transformer-based model that integrates diverse biological datasets into a unified framework using knowledge graphs for accurate and interpretable protein function prediction. ProtHGT achieves state-of-the-art performance on benchmark datasets, demonstrating its ability to outperform current graph-based and sequence-based approaches. By leveraging diverse biological entity types and highly representative protein language model embeddings at the input level, the model effectively learns complex biological relationships, enabling accurate predictions across all Gene Ontology (GO) sub-ontologies. Ablation analyses highlight the critical role of heterogeneous data integration in achieving robust predictions. Finally, our use-case study has indicated that it’s possible to interpret ProtHGT’s predictions via exploring the related parts of our input biological knowledge graph, offering plausible explanations to build or test new hypotheses.

| alt text for screen readers | |:--:| | Schematic representation of the ProtHGT framework. a) Diverse biological datasets, including proteins, pathways, domains, and GO terms, are integrated into a unified knowledge graph; b) the heterogeneous graph is constructed, capturing multi-relational biological associations; c) feature vectors for each node type are generated using state-of-the-art embedding methods; d) protein function prediction models are trained separately for molecular function, biological process, and cellular component sub-ontologies; e) heterogeneous graph transformer (HGT) layers process and refine node representations through multi-relational message passing. Final protein function predictions are obtained by linking proteins to GO terms based on learned embeddings and attention-weighted relationships. |

Content <!-- omit in toc -->

The Architecture of ProtHGT

ProtHGT builds upon the Heterogeneous Graph Transformer (HGT) architecture, consisting of multiple stacked transformer layers to refine node embeddings while preserving node-type and edge-type diversity.

1. Input Feature Transformation

Each node type (e.g., Protein, GO Term) is projected into a shared hidden space using independent linear transformations:

self.lin_dict = torch.nn.ModuleDict({
    node_type: Linear(data.x_dict[node_type].size(-1), hidden_channels)
    for node_type in data.node_types
})

This ensures that different biological entities have their own representation before message passing.

2. Heterogeneous Graph Transformer Layers

ProtHGT applies multiple HGT layers (HGTConv implementation from PyTorch Geometric) to propagate information across node types using multi-head attention:

self.convs = torch.nn.ModuleList()
for _ in range(num_layers):
    conv = HGTConv(hidden_channels, hidden_channels, data.metadata(), num_heads, group='sum')
    self.convs.append(conv)

3. Protein Function Prediction (Link Prediction)

ProtHGT models protein function prediction as a link prediction task between Protein and GO Term nodes. The final embeddings of the two nodes are concatenated and passed through an MLP for classification:

row, col = tr_edge_label_index
z = torch.cat([x_dict["Protein"][row], x_dict[target_type][col]], dim=-1)
return self.mlp(z).view(-1), x_dict

The MLP predicts the probability of a functional association between a protein and a GO term.

Repository Structure

  • data/: Contains the knowledge graph data required for training and evaluating ProtHGT.
    Data files can be downloaded from Hugging Face and must be placed in this directory for train.py and predict.py to function properly. For more details, please refer to the data/README.md.
  • models/: Contains trained models for each Gene Ontology (GO) category. It includes models trained with the default TAPE protein embeddings as well as alternative protein representations (e.g., ESM2, ProtT5). All models are trained on their own corresponding knowledge graph (KG) datasets—be sure to select the appropriate model that matches the protein embedding type you intend to use.
  • configs/: Contains configuration files specifying optimized model hyperparameters and training settings. There are configuration files both for TAPE-based KG datasets (optimized for those embeddings) and for datasets using alternative protein embeddings (e.g., ESM2, ProtT5). Select the configuration that matches the embedding type of your chosen model.
  • src/: Main source code directory
    • model.py: Implementation of the ProtHGT architecture
    • train.py: Script for training the model
    • predict.py: Script for generating predictions
    • utils.py: Helper functions and utilities
    • data_loader.py: Helper functions for data loading and preprocessing
  • requirements.txt: Lists all Python package dependencies

Getting Started

We highly recommend you to use conda platform for installing dependencies properly. After installation of appropriate conda version for your operating system, create and activate conda environment with dependencies as below:

conda create -n prothgt
conda activate prothgt

Then, install the dependencies using the requirements.txt file:

pip install -r requirements.txt

Training the ProtHGT Model

For training the ProtHGT model, run the train.py script with the following example command:

python train.py --train-data ../data/prothgt-train-graph.pt --val-data ../data/prothgt-val-graph.pt --test-data ../data/prothgt-test-graph.pt --target-type GO_term_F --config ../configs/prothgt-config-molecular-function.yaml

Arguments:

  • --train-data: Path to the training data file
  • --val-data: Path to the validation data file
  • --test-data: Path to the test data file
  • --target-type: Target prediction type. It can be one of the following: GO_term_F for molecular function, GO_term_P for biological process, and GO_term_C for cellular component.
  • --config: Path to the configuration file. You can use your own or select from the optimized hyperparameter configurations in configs/ directory.
  • --output-dir: Path to the output directory. Default is ../outputs.
  • --checkpoint-dir: Path to the checkpoint directory. Default is None.
  • --num-workers: Number of workers for data loading. Default is 2.

Before running the training script, make sure that the data files are correctly placed in the data/ directory.

Making Predictions

To generate function predictions for a given protein list using ProtHGT, you can either use our web-service here: ProtHGT Web-Service or run the predict.py script with the following example command. This script uses pre-trained ProtHGT models—currently trained with TAPE embeddings as the default protein representations—available in the models/ directory. Models trained with alternative protein embeddings will be provided in future releases.

python predict.py --protein_ids ..data/example_protein_ids.txt --protein_embedding tape --go_category all

Arguments:

  • --protein_ids: You can either provide a text file containing a list of protein IDs or a comma-separated string of protein IDs.
  • --protein_embedding: Protein embedding to use. It can be one of the following: tape, prott5, or esm2.
  • --go_category: GO category to predict. It can be one of the following: all, molecular_function, biological_process, or cellular_component.
  • --output_dir: Path to the output directory. Default is ../predictions.
  • --batch_size: Number of proteins to process in each batch. Default is 100.
  • --threshold: Threshold for filtering predictions. Default is 0.0.
  • --top_k: Keep only top-k GO terms per protein (0 = keep all). Default is 0.

The output file is a csv file containing the following columns:

  • Protein: UniProt ID
  • GO_term: GO term ID
  • GO_category: GO term category. Either Molecular Function, Biological Process, or Cellular Component.
  • Probability: Probability of the prediction.

Note: Currently, ProtHGT can only generate predictions for proteins that exist in our knowledge graph, which includes over 300,000 UniProtKB/Swiss-Prot proteins. To enable predictions for novel proteins from their sequences, we are developing a real-time data retrieval system that dynamically fetches relational data from external sources (e.g., STRING, Reactome) and constructs a customized knowledge graph for inference. This system will allow ProtHGT to predict functions for previou

Related Skills

View on GitHub
GitHub Stars18
CategoryEducation
Updated2mo ago
Forks1

Languages

Python

Security Score

95/100

Audited on Jan 27, 2026

No findings