OpenGraph

[EMNLP'2024] "OpenGraph: Towards Open Graph Foundation Models"

Generate Convert Improve

Install / Use

/learn @HKUDS/OpenGraph

About this skill

Quality Score

0/100

README

OpenGraph: Towards Open Graph Foundation Models

<a href='https://akaxlh.github.io/'>Lianghao Xia</a>, <a href='https://scholar.google.com/citations?user=TwSParMAAAAJ'>Ben Kao</a>, and <a href='https://sites.google.com/view/chaoh/group-join-us'>Chao Huang*</a> (*Correspondence)

Presenting OpenGraph, a foundation graph model <b><i>distilling zero-shot graph generalizability from LLMs</i></b>.

To achieve this goal, OpenGraph addresses several key technical challenges:

We propose a unified graph tokenizer to adapt our graph model to generalize well on unseen graph data, even when the underlying graph properties differ significantly from those encountered during training.
We develop a scalable graph transformer as the foundational encoder, which effectively and efficiently captures node-wise dependencies within the global topological context.
We introduce a data augmentation mechanism enhanced by a large language model (LLM) to alleviate the limitations of data scarcity in real-world scenarios.

Extensive experiments validate the effectiveness of our framework. By adapting OpenGraph to new graph characteristics and comprehending the nuances of diverse graphs, our approach achieves remarkable zero-shot graph learning performance across various settings and domains.

Environment Setup

You need to unzip some of the data files in datasets/. Download the pre-trained models using the link in Models/readme. Our experiments were conducted with the following package versions:

python==3.10.13
torch==1.13.0
numpy==1.23.4
scipy==1.9.3

Brief Code Structure

Here is a brief overview of the code structures. The explanations for each directory are enclosed in quotes (##...##). For a more detailed version, please refer to the full version listed at the end of this readme.

./
│   └── README.md
│   ├── History/ ## Training history of pre-trained models ##
│   ├── Models/ ## Pre-trained models ##
│   ├── datasets/
│   ├── graph_generation/ ## Code and examples for graph generation ##
│   ├── imgs/ ## Images used in readme ##
│   ├── link_prediction/ ## code for link prediction and pre-training ##
│   │   ├── data_handler.py
│   │   ├── main.py
│   │   ├── model.py
│   │   └── params.py
│   │   ├── Utils/
│   │   │   └── TimeLogger.py
│   ├── node_classification/ ## code for testing on node classification ##
│   │   ├── data_handler.py
│   │   ├── main.py
│   │   ├── model.py
│   │   └── params.py
│   │   ├── Utils/
│   │   │   └── TimeLogger.py

Usage

To reproduce the test performance reported in the paper, run the following command lines:

cd link_prediction/
python main.py --load pretrn_gen1 --epoch 0 # test on OGBL-Collab, ML-1M, ML-10M
python main.py --load pretrn_gen0 --tstdata amazon-book --epoch 0 # test on Amazon-Book
python main.py --load pretrn_gen2 --tstdata ddi --epoch 0 # test on OGBL-ddi
cd ../node_classification/
python main.py --load pretrn_gen1 --tstdata cora # test on Cora
python main.py --load pretrn_gen1 --tstdata citeseer # test on Citeseer
python main.py --load pretrn_gen1 --tstdata pubmed # test on Pubmed

To re-pretrain OpenGraph by yourself, run the following command lines:

cd ../link_prediction/
python main.py --save pretrn_gen1
python main.py --trndata gen0 --tstdata amazon-book --save pretrn_gen0
python main.py --trndata gen2 --tstdata ddi --save pretrn_gen2

To explore pretraining with multiple different pre-training and testing datasets, modify `trn_datasets` and `tst_datasets` in line 241 of `link_prediction/main.py`.

Graph Data Generation

The graph generation code is in graph_generation/. A toy dataset of small size is given. You need to fill in your OpenAI key in Utils.py and itemCollecting_dfsIterator.py first. To generate your dataset, modify the descs and hyperparams dicts, and follow the following procedure:

cd graph_generation/
python itemCollecting_dfsIterator.py
python instance_number_estimation_hierarchical.py
python embedding_generation.py
python human_item_generation_gibbsSampling_embedEstimation.py
python make_adjs.py

Below shows our prompt template, as well as examples for prompt configurations and generated nodes.

Evaluation Results

Overall Generalization Performance

OpenGraph achives best performance under the 0-shot setting, compared to baselines trained/tuned with 1-shot and 5-shot data. <img src='imgs/performance.png' />

Pre-training Dataset Study

We studied the influence of using different pre-training datasets. Results below indicate that:

The generation techniques (Norm, Loc, Topo) have positive effects on performance.
Real-world datasets (Yelp2018, Gowalla) may yield worse results compared to our generated ones.
A relevant pre-training dataset (ML-10M for test data ML-1M and ML-10M) results in superior performance.

Graph Tokenizer Study

We tuned configurations of our unified graph tokenizer, by adjusting the order of graph smoothing, and replacing our topology-aware projection with alternatives. Our findings include:

Adjacency smoothing is important, as OpenGraph with 0-order smoothing yields inferior performance.
Topology-aware projection is superior in performance. Alternatives include One-hot which learns a big and unified representation table for all datasets, Random which holds no assumption for the node-wise relations and distributes them uniformly, Degree which is a widely-used method for non-attributed graphs and seems applicable for cross-graph scenario.

Sampling Techniques Study

We ablated the two sampling techniques in the graph transformer, and show their positive effects on both memory and time costs below. Suprisingly, token sequence sampling shows a positive effect over the model performance.

Citation

If you find this work useful for your research, please consider citing our paper:

@inproceedings{xia2024opengraph,
  title={OpenGraph: Towards Open Graph Foundation Models},
  author={Xia, Lianghao and Kao, Ben and Huang, Chao},
  booktitle={EMNLP},
  year={2024}
}

Detailed Code Structures

./
│   └── README.md
│   ├── History/ ## Training history of pre-trained models ##
│   │   ├── pretrn_gen0.his
│   │   ├── pretrn_gen2.his
│   │   └── pretrn_gen1.his
│   ├── Models/ ## Pre-trained models ##
│   │   └── readme ## Download pre-trained models using the link inside ##
│   ├── datasets/
│   │   ├── amazon-book/
│   │   │   ├── fewshot_mat_1.pkl
│   │   │   ├── trn_mat.pkl.zip ## Unzip it manually ##
│   │   │   ├── tst_mat.pkl
│   │   │   └── fewshot_mat_5.pkl
│   │   ├── citeseer/
│   │   │   ├── adj_-1.pkl
│   │   │   ├── adj_1.pkl
│   │   │   ├── adj_5.pkl
│   │   │   ├── feats.pkl.zip ## Unzip it manually ##
│   │   │   ├── label.pkl
│   │   │   ├── mask_-1.pkl
│   │   │   ├── mask_1.pkl
│   │   │   └── mask_5.pkl
│   │   ├── collab/
│   │   │   ├── fewshot_mat_5.pkl
│   │   │   ├── trn_mat.pkl.zip ## Unzip it manually ##
│   │   │   ├── tst_mat.pkl
│   │   │   ├── val_mat.pkl
│   │   │   └── fewshot_mat_1.pkl
│   │   ├── cora/
│   │   │   ├── adj_-1.pkl
│   │   │   ├── adj_1.pkl
│   │   │   ├── adj_5.pkl
│   │   │   ├── feats.pkl
│   │   │   ├── label.pkl
│   │   │   ├── mask_-1.pkl
│   │   │   ├── mask_1.pkl
│   │   │   └── mask_5.pkl
│   │   ├── ddi/
│   │   │   ├── fewshot_mat_1.pkl
│   │   │   ├── trn_mat.pkl.zip ## Unzip it manually ##
│   │   │   ├── tst_mat.pkl
│   │   │   ├── val_mat.pkl
│   │   │   └── fewshot_mat_5.pkl
│   │   ├── gen0/
│   │   │   ├── trn_mat.pkl
│   │   │   ├── val_mat.pkl
│   │   │   └── tst_mat.pkl
│   │   ├── gen1/
│   │   │   ├── trn_mat.pkl
│   │   │   ├── tst_mat.pkl
│   │   │   └── val_mat.pkl
│   │   ├── gen2/
│   │   │   ├── trn_mat.pkl
│   │   │   ├── val_mat.pkl
│   │   │   └── tst_mat.pkl
│   │   ├── ml10m/
│   │   │   ├── fewshot_mat_1.pkl
│   │   │   ├── trn_mat.pkl.zip ## Unzip it manually ##
│   │   │   ├── tst_mat.pkl.zip ## Unzip it manually ##
│   │   │   └── fewshot_mat_5.pkl
│   │   ├── ml1m/
│   │   │   ├── fewshot_mat_5.pkl
│   │   │   ├── trn_mat.pkl
│   │   │   ├── tst_mat.pkl
│   │   │   └── fewshot_mat_1.pkl
│   │   ├── pubmed/
│   │   │   ├── adj_-1.pkl
│   │   │   ├── adj_1.pkl
│   │   │   ├── feats.pkl.zip ## Unzip it manually ##
│   │   │   ├── label.pkl
│   │   │   ├── mask_-1.pkl
│   │   │   ├── mask_1.pkl
│   │   │   ├── mask_5.pkl
│   │   │   └── adj_5.pkl
│   ├── graph_generation/ ## Code and examples for graph generation ##
│   │   ├── embedding_generation.py ## Node embedding generation ##
│   │   ├── human_item_generation_gibbsSampling_embedEstimation.py ## Edge generation ##
│   │   ├── instance_number_estimation_hierarchical.py ## Estimate amount for each node. Not mentioned in the paper. ##
│   │   ├── itemCollecting_dfsIterator.py ## Node generation ##
│   │   ├── make_adjs.py ## Making datasets for generated gaphs ##
│   │   └── Utils.py
│   │   ├── Exp_Utils/
│   │   │   ├── Emailer.py ## A tool to send warning email for experiments ##
│   │   │   └── TimeLogger.py
│   │   ├── gen_results/
│   │   │   ├── tree_wInstanceNum_products_e-commerce platform like Amazon.pkl ## T

Related Skills

proje

Interactive vocabulary learning platform with smart flashcards and spaced repetition for effective language acquisition.

YC-Killer

2.7k

A library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.

best-practices-researcher

The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app

groundhog

400

Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).

HKUDS

View profile

View on GitHub

GitHub Stars329

CategoryEducation

Updated2d ago

Forks37

HKUDS/OpenGraph

Languages

Python

Security Score

100/100

Audited on Apr 6, 2026

No findings