GraphAny: Fully-inductive Node Classification on Arbitrary Graphs

</div>

Original PyTorch implementation of GraphAny.

Authored by Jianan Zhao, Zhaocheng Zhu, Mikhail Galkin, Hesham Mostafa, Michael Bronstein, and Jian Tang.

Overview

Fully-Inductive Model on Node Classification

GraphAny is a fully-inductive model for node classification. A single trained GraphAny model performs node classification tasks on any graph with any feature and label spaces. Performance-wise, averaged on 30+ graphs, a single trained GraphAny model in inference mode is better than many transductive (supervised) models (e.g., MLP, GCN, and GAT) trained specifically for each graph. Following the pretrain-inference paradigm of foundation models, you can perform training from scratch and inference on 30 datasets as shown in Training from scratch.

This repository is based on PyTorch 2.1, Pytorch-Lightning 2.2, PyG 2.4, DGL 2.1, and Hydra 1.3.

Environment Setup

Our experiments are designed to run on both GPU and CPU platforms. A GPU with 16 GB of memory is sufficient to handle all 31 datasets, and we have also tested the setup on a single CPU (specifically, an M1 MacBook).

To configure your environment, use the following commands based on your setup:

# For setups with a GPU (requires CUDA 11.8):
conda env create -f environment.yaml
# For setups using a CPU (tested on macOS with M1 chip):
conda env create -f environment_cpu.yaml

File Structure

├── README.md
├── checkpoints
├── configs
│   ├── data.yaml
│   ├── main.yaml
│   └── model.yaml
├── environment.yaml
├── environment_cpu.yaml
└── graphany
    ├── __init__.py
    ├── data.py
    ├── model.py
    ├── run.py
    └── utils

Reproduce Our Results

Training GraphAny from Scratch

This section would detail how users can train GraphAny on one dataset (Cora, Wisconsin, Arxiv, or Product) and evaluate on all 31 datasets. You can reproduce our results via the commands below. The checkpoints of these commands are saved in the checkpoints/ folder.

cd path/to/this/repo
# Reproduce GraphAny-Cora: test_acc= 66.98 for seed 0
python graphany/run.py dataset=CoraXAll total_steps=500 n_hidden=64 n_mlp_layer=1 entropy=2 n_per_label_examples=5
# Reproduce GraphAny-Wisconsin: test_acc= 67.36 for seed 0
python graphany/run.py dataset=WisXAll total_steps=1000 n_hidden=32 n_mlp_layer=2 entropy=1 n_per_label_examples=5
# Reproduce GraphAny-Arxiv: test_acc=67.58 for seed 0
python graphany/run.py dataset=ArxivXAll total_steps=1000 n_hidden=128 n_mlp_layer=2 entropy=1 n_per_label_examples=3
# Reproduce GraphAny-Product: test_acc=67.77 for seed 0
python graphany/run.py dataset=ProdXAll total_steps=1000 n_hidden=128 n_mlp_layer=2 entropy=1 n_per_label_examples=3

Inference Using Pre-trained Checkpoints

Once trained, GraphAny enjoys the ability to perform inference on any graph. You can use our trained checkpoint to run inference on your graph easily. Here, we showcase an example of loading a GraphAny model trained on Arxiv and perform inference on Cora and Citeseer.

Step 1: Define your custom combined dataset config in the configs/data.yaml :

# configs/data.yaml
_dataset_lookup:
  # Train on Arxiv, inference on Cora and Citeseer
  CoraCiteInference:
    train: [ Arxiv ]
    eval: [ Cora, Citeseer ]

Step 2 (optional): Define your dataset processing logic in graph_any/data.py. This step is necessary only if you are not using our pre-processed data. If you choose to use our provided datasets, you can skip this step and proceed directly to Step 3.

Step 3: Inference using pre-trained model using command:

python graphany/run.py prev_ckpt=checkpoints/graph_any_arxiv.pt total_steps=0 dataset=CoraCiteInference
# ind/cora_test_acc 79.4 ind/cite_test_acc 68.4

<details> <summary>Example Output Log</summary> <pre><code># Training Logs CRITICAL { 'ind/cora_val_acc': 75.4, 'ind/cite_val_acc': 70.4, 'val_acc': 72.9, 'trans_val_acc': nan, # Not applicable as Arxiv is not included in the evaluation set 'ind_val_acc': 72.9, 'heldout_val_acc': 70.4, 'ind/cora_test_acc': 79.4, 'ind/cite_test_acc': 68.4, 'test_acc': 73.9, 'trans_test_acc': nan, 'ind_test_acc': 73.9, 'heldout_test_acc': 68.4 } INFO Finished main at 06-01 05:07:49, running time = 2.52s. </code></pre>

Note: The trans_test_acc field is not applicable since Arxiv is not specified in the evaluation datasets. Additionally, the heldout accuracies are calculated by excluding datasets specified as transductive in configs/data.yaml (default settings: _trans_datasets: [Arxiv, Product, Cora, Wisconsin]). To utilize the heldout metrics correctly, please adjust these transductive datasets in your configuration to reflect your specific dataset inductive split settings.

</details>

Configuration Details

We use Hydra to manage the configuration. The configs are organized in three files under the configs/ directory:

`main.yaml`

Settings for experiments, including random seed, wandb, path, hydra, and logging configs.

`data.yaml`

This file contains settings for datasets, including preprocessing specifications, metadata, and lookup configurations. Here’s an overview of the key elements:

Dataset Preprocessing Options

preprocess_device: gpu — Specifies the device for computing propagated features $\boldsymbol{F}$. Set to cpu if your GPU memory is below 32GB.
add_self_loop: false — Specifies whether to add self-loops to the nodes in the graph.
to_bidirected: true — If set to true, edges are made bidirectional.
n_hops: 2 — Defines the maximum number of hops of message passing. In our experiments, besides Linear, we use LinearSGC1, LinearSGC1, LinearHGC1, LinearHGC2, which predicts information within 2 hops of message passing.

Train and Evaluation Dataset Lookup

The datasets for training and evaluation are dynamically selected based on the command-line arguments by looking up from the _dataset_lookup configuration
Example: Using dataset=CoraXAll sets train_datasets to [Cora] and eval_datasets to all datasets (31 in total).

train_datasets: ${oc.select:_dataset_lookup.${dataset}.train,${dataset}}
eval_datasets: ${oc.select:_dataset_lookup.${dataset}.eval,${dataset}}
_dataset_lookup:
- CoraXAll:
  - train: [Cora]
  - eval: ${_all_datasets}

Please define your own dataset combinations in _dataset_lookup if desired.

Detailed Dataset Configurations

The dataset meta-data stores the meta information including the interfaces DGL, PyG, OGB, Heterophilous and their aliases (e.g. Planetoid.Cora) to load the dataset. The statistics are provided in the comment with a format of 'n_nodes, n_edges, n_feat_dim, n_labels'. For example:

_ds_meta_data:
  Arxiv: ogb, ogbn-arxiv # 168,343 1,166,243 100 40
  Cora: pyg, Planetoid.Cora # 2,708 10,556 1,433 7

</details>

`model.yaml`

This file contains the settings for models and training.

GraphAny leverages interactions between predictions as input features for an MLP to calculate inductive attention scores. These inputs are termed "feature channels" and are defined in the configuration file as feat_chn. Subsequently, the outputs from LinearGNNs, referred to as "prediction channels", are combined using inductive attention scores and are defined as pred_chn in the configuration file. The default settings are:

feat_chn: X+L1+L2+H1+H2 # X=Linear, L1=LinearSGC1, L2=LinearSGC2, H1=LinearHGC1, H2=LinearHGC2
pred_chn: X+L1+L2 # H1 and H2 channels are masked to enhance convergence speed.

It is important to note that the feature channels and prediction channels do not need to be identical. Empirical observations indicate that masking LinearHGC1 and LinearHGC2 leads to faster convergence and marginally improved results (results in Table 2, Figure 1, and Figure 5). Furthermore, for the attention visualizations in Figure 6, all five channels (pred_chn=X+L1+L2+H1+H2) are employed. This demonstrates GraphAny's capability to learn inductive attention that effectively identifies critical channels for unseen graphs.

Other model parameters and default values:

# The entropy to normalize the distance features (conditio

GraphAny

Install / Use

README