SkillAgentSearch skills...

NodeCoder

NodeCoder is a general framework based on graph convolutional neural network for protein function prediction.

Install / Use

/learn @NasimAbdollahi/NodeCoder
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<!-- # NodeCoder Pipeline --> <img src="/figures/Nodecoder_banner.png" width = "1070">

A PyTorch implementation of NodeCoder Pipeline, a Graph Convolutional Network (GCN) - based framework for protein residue characterization. This work was presented at NeurIPS MLSB 2021: Residue characterization on AlphaFold2 protein structures using graph neural networks. link to paper

Link to GitHub repository.

Abstract:

Three-dimensional structure prediction tools offer a rapid means to approximate the topology of a protein structure for any
protein sequence. Recent progress in deep learning-based structure prediction has led to highly accurate predictions that have
recently been used to systematically predict 20 whole proteomes by DeepMind’s AlphaFold and the EMBL-EBI. While highly convenient,
structure prediction tools lack much of the functional context presented by experimental studies, such as binding sites or 
post-translational modifications. Here, we introduce a machine learning framework to rapidly model any residue-based
classification using AlphaFold2 structure-augmented protein representations. Specifically, graphs describing the 3D structure of
each protein in the AlphaFold2 human proteome are generated and used as input representations to a Graph Convolutional Network
(GCN), which annotates specific regions of interest based on the structural attributes of the amino acid residues, including their
local neighbors. We demonstrate the approach using six varied amino acid classification tasks.
<img src="/figures/NodeCoder_Pipeline.png" width = "1070">

Table of Contents

🧬 What does NodeCoder Pipeline do? <br> ⚙️ Installing NodeCoder <br> 🔌 NodeCoder Usage <br> 🗄️ Graph data files <br> 📂 Output files <br> 🗃 Data availability <br> 🤝 Collaborators <br> 🔐 License <br> 📄 Citing this work

<a name="u1"></a>

🧬 What does the NodeCoder Pipeline do?


The NodeCoder is a generalized framework that annotates 3D protein structures with predicted tasks such as binding sites. The NodeCoder model is based on Graph Convolutional Network. NodeCoder generates proteins' graphs from ALphaFold2 augmented proteins' structures where the nodes are the amino acid residues and edges are inter-residue contacts within a preset distance. The NodeCoder model is then trained with generated graph data for users task of interest like: Task = ['y_Ligand']. When running inference, NodeCoder takes the Protein ID like EGFR_HUMAN and for the proteins that are already in the database, input graph data files are created from the AlphaFold2 protein structure in addition to calculating some structure-based and sequence-based residue features. The input graph data will then be given to the trained model for prediction of multiple tasks of interest such as binding sites or post-translational modifications.

<img src="/figures/NodeCoder_FunctionBlocks.png" width = "950">

<a name="u2"></a>

⚙️ Installing NodeCoder


Required dependencies

The codebase is implemented in Python 3.8.5 and package versions used for development are:

numpy              1.19.2
pandas             1.2.4
scipy              1.6.3
torch              0.4.1
torchvision        0.9.1
torchaudio         0.8.1
torch_scatter      2.0.6
torch_sparse       0.6.9
torch_cluster      1.5.9
torch_spline_conv  1.2.0
torch-geometric    1.7.0  
scikit-learn       0.24.2
matplotlib         3.3.3
biopython          1.77
freesasa           2.0.5.post2
loguru             0.6.0

Installation steps

Here is the step-by-step NodeCoder installation process:<br> Method 1 - install test.pypi package<br>

  1. Before installing NodeCoder, we highly recommend to create a virutal Python 3.8.5 environment using venv command, or Anaconda. Assuming you have anaconda3 installed on your computer, on your Terminal run the following command line:
$ conda create -n <your_python_env> python=3.8.5
  1. Make sure your virtual environment is active. For conda environment you can use this command line:
$ conda activate <your_python_env>
  1. Now you can install NodeCoder with this command line:
$ pip install --extra-index-url https://test.pypi.org/simple/ NodeCoder

Method 2 - install from GitHub repository<br> Follow above-mentioned steps 1-2, and continue with the following steps: 3. Clone the repository:

$ git clone https://github.com/NasimAbdollahi/NodeCoder.git
  1. Make sure you are in the root directory of the NodeCoder package ~/NodeCoder/ (where setup.py is). Now install NodeCoder package with following command line, which will install all dependencies in the python environment you created in step 1:
$ pip install .

<a name="u3"></a>

🔌 NodeCoder Usage


NodeCoder package can be employed for train and inference. Here we describe how to use it:

🗂️ Preprocessing raw data

link to paper NodeCoder uses AlphaFold2 modeled protein structures as input. AlphaFold protein structure database provides open access to protein structure predictions of human proteome and other key proteins of interest. Prediction labels can be obtained from BioLip database and Uniprot database.

Once you downloaded the protein databases, first step is to run the NodeCoder's featurizer module to process these raw data sets and extract node features and labels. When using NodeCoder's featurizer module, preprocess_raw_data, you will need to specify the directories you have the datasets saved:

alphafold_data_path
uniprot_data_path
biolip_data_path
biolip_data_skip_path

The featurizer module will create two files for every protein in the selected proteome: <font color='#D55E00'> *.features.csv </font> and <font color='#D55E00'> *.tasks.csv </font>. These files are saved in <font color='#D55E00'> NodeCoder/data/input_data/featurized_data/TAX_ID/ </font> directory in separate folders of <font color='#D55E00'> features </font> and <font color='#D55E00'> tasks </font>. For example if user choose human proteome, 9606, then the following tree structure will be generated:

data/input_data/featurized_data/
└── 9606
    ├── features
    └── tasks

The command line to run the featurizer module is:

$ python NodeCoder/preprocess_raw_data.py

To use NodeCoder as python package, import preprocessing module as:

>>> from NodeCoder import preprocess_raw_data 
>>> preprocess_raw_data.main(alphafold_data_path='.', uniprot_data_path='.', biolip_data_path='.', biolip_data_skip_path='.')

The default species/proteome is HUMAN, but user can change it with the following parameters:

>>> preprocess_raw_data.main(alphafold_data_path='.', uniprot_data_path='.', biolip_data_path='.', biolip_data_skip_path='.',
                              TAX_ID='9606', PROTEOME_ID='UP000005640')

🗃️ Generate graph data

The next step after running the featurizer is to generate graph data from the features and tasks files. NodeCoder has a graph-generator module that generate protein graph data by taking a threshold for distance between amino acid residues. The threshold distance is required to be defined by user in Angstrom unit to create the graph contact network, threshold_dist = 5. Graph data files are saved in this directory <font color='#D55E00'>./data/input_data/graph_data_*A/</font> with the following tree structure (the example here is for 8A cut-off distance and 5 folds for cross-validation):

data/input_data/graph_data_8A/
└── 9606
    └── 5FoldCV

The command line to run the graph generator module is:

$ python NodeCoder/generate_graph_data.py

To use NodeCoder as python package, import generate_graph_data module as:

>>> from NodeCoder import generate_graph_data 
>>> generate_graph_data.main()

Where, user can specify the following parameters

>>> generate_graph_data.main(TAX_ID='9606', PROTEOME_ID='UP000005640', threshold_dist=5, cross_validation_fold_number=5)

Note that for cross-validation setting, separate graphs are created for each fold.

🧠 Train NodeCoder

To train NodeCoder's graph-based model, user can use train.py module. Script parser.py has the model parameters used for training the model. User would need to use the following parameters in train.py script to specify the task/tasks of interest and the cutoff distance for defining the protein contact network:

Task = ['y_Ligand']
threshold_dist = 5

Command line to train NodeCoder:

$ python NodeCoder/train.py

To use NodeCoder as python package, import train module as:

>>> from NodeCoder import train 
>>> train.main()

Where, user can specify the following parameters

>>> train.main(threshold_dist=5, multi_task_learning=False, Task=['y_Ligand'], centrality_feature=True,
         cross_validation_fold_number=5, epochs=1000)

Here is a list of available training tasks (residue labels/annotations) :

'y_CHAIN', 'y_TRANSMEM', 'y_MOD_RES', 'y_ACT_SITE', 'y_NP_BIND', 
'y_LIPID', 'y_CARBOHYD', 'y_DISULFID', 'y_VARIANT', 'y_Artifact', 
'y_Peptide', 'y_Nucleic', 'y_Inorganic', 'y_Cofactor', 'y_Ligand'

🤖 Inference with NodeCoder

To use trained NodeCoder for protein functions prediction, user needs to run predict.py

Related Skills

View on GitHub
GitHub Stars34
CategoryDevelopment
Updated4mo ago
Forks6

Languages

Python

Security Score

87/100

Audited on Nov 24, 2025

No findings