TSGFM
NIPS 24: Text-space Graph Foundation Models: Comprehensive Benchmarks and New Insights
Install / Use
/learn @CurryTang/TSGFMREADME
Code and Datasets for Text-space Graph Foundation Models: Comprehensive Benchmarks and New Insights
Notes: we find a parameter error for previous evaluation for Prodigy, please see our updated results and new commands. (You must set the task to classification and use eval_only True for evaluation otherwise there will be some leakage problem)
This is the code repo accompanying our paper "Text-space Graph Foundation Models: Comprehensive Benchmarks and New Insights."
We implement the following graph foundation model building blocks.
- Graph prompt models (OneForAll, Prodigy)
- GraphLLM (LLaGA)
- Graph Self-supervised learning (GraphMAE, BGRL, DGI, and so on)
- Link prediction-specific models, including BUDDY and SEAL
We support the following two scenarios.
- Co-training: Pre-training on a set of datasets and testing on the same ones
- Pre-training: Pre-training on a set of datasets and testing on unseen ones
Install
pip install -r requirements.txt
Datasets
We follow OneForAll's way of managing the datasets. We support the following datasets. | Name | #Graphs | #Nodes | #Edges | Domains | Tasks | #classes | |----------------|---------|--------|----------|--------------|------------|----------| | Cora | 1 | 2708 | 10556 | CS Citation | Node, Link | 7 | | CiteSeer | 1 | 3186 | 8450 | CS Citation | Node, Link | 6 | | Arxiv | 1 | 169343 | 2315598 | CS Citation | Node, Link | 40 | | Arxiv23 | 1 | 46198 | 77726 | CS Citation | Node, Link | 40 | | History | 1 | 41551 | 503180 | E-commerce | Node, Link | 12 | | Child | 1 | 76875 | 2325044 | E-commerce | Node, Link | 24 | | Computers | 1 | 87229 | 1256548 | E-commerce | Node, Link | 10 | | Photo | 1 | 48362 | 873782 | E-commerce | Node, Link | 12 | | Sportsfit | 1 | 173055 | 3020134 | E-commerce | Node, Link | 13 | | Products | 1 | 316513 | 19337722 | E-commerce | Node, Link | 39 | | Amazon Ratings | 1 | 24492 | 186100 | E-commerce | Node, Link | 5 | | Pubmed | 1 | 19717 | 88648 | Bio Citation | Node, Link | 3 | | WikiCS | 1 | 11701 | 431726 | Knowledge | Node, Link | 10 | | Tolokers | 1 | 11758 | 1038000 | Anomaly | Node, Link | 2 | | DBLP | 1 | 14376 | 431326 | CS Citation | Node, Link | 4 | | CheMBL | 365065 | 26 | 112 | Biology | Graph | 1048 | | PCBA | 437092 | 26 | 56 | Biology | Graph | 128 | | HIV | 41127 | 26 | 55 | Biology | Graph | 2 | | Tox21 | 7831 | 19 | 39 | Biology | Graph | 12 | | Bace | 1513 | 34 | 74 | Biology | Graph | 2 | | Bbbp | 2039 | 24 | 52 | Biology | Graph | 2 | | Muv | 93087 | 24 | 53 | Biology | Graph | 17 | | Toxcast | 8575 | 19 | 39 | Biology | Graph | 588 |
The processed file versions can be achieved from the following link.
Structures of the processed files:
cache_data_{llm encoder name}(for example, minilm)dataset_nameprocesseddata.ptgeometric_data_processed.ptpre_filter.ptpre_transform.pttexts.pkl
geometric_data_processed.pt is the core storage object, and node_text_feat stores the processed node features.
data.pt contains the index file used to query the attributes stored in geometric_data_processed.pt.
A comprehensive introduction of each column can be found in OneForAll's repo.
To prepare the data, it's okay to generate all raw files yourself (run oneforall for 1 epoch, including all datasets). I recommend you use the preprocessed files directly and unzip them to the main directory.
Code Structures
Directories
configs: Directory for setting the task/dataset for OneForAll. Add new datasets heredata: data utility files/generation files using the OneForAll data interfacegp: graph utility files from the original OneForAll repographllm: utility files for LLaGAgraphmae: utility files for graphmaelink: utility files for BUDDYmodels: model implementationsprodigy: prodigy filessubgcon: utility files/data files for self-supervised learning
Main entries
eval_pretrain_*, eval_res: main files for LLaGAfulllink.py: main files for GCN link predictionlinkpred.py: main files for BUDDY/SEALrun_cdm: main files for OFAsslmain: main files for SSLsimplerlr: main files for simpleSBERT
Reproduce the results
OneForAll
- Co-training setting: just set up a config file similar to
demo/e2e_all_config.yaml - Pre-training setting: when loading the pre-trained model, use
gnn_load_path.
LLaGA
- Use
llm_train.shto generate checkpoints - Use
llm_eval.shorllm_eval_link.shto generate the answer files for node/link-level tasks. For example,bash llm_eval.sh citeseer nc ./checkpoints/llaga-mistral-7b-hf-sbert-4-hop-token-linear-cora.3-citeseer.4-pubmed.3-nc-lp-projector/ citationcross - Use
llmres.shto calculate the results
GCN-link
python3 fulllink.py --pre_train_datasets "cora-link" "citeseer-link" "pubmed-link" "arxiv-link" "arxiv23-link" "bookhis-link" "bookchild-link" "sportsfit-link" "products-link" "elecomp-link" "elephoto-link" --encoder gcn --num_layers 3 --num_hidden 128 --batch_size 512
BUDDY/SEAL
python3 linkpred.py --pre_train_datasets cora citeseer arxiv arxiv23 bookhis bookchild elecomp elephoto sportsfit products pubmed wikics --model BUDDY --cache_subgraph_features --max_hash_hops 3 --epochs 50
python3 linkpred.py --pre_train_datasets cora --model SEALGCN --hidden_channels 256 --num_hops 3
SSL
Check the best hyper-parameter in the paper (use cpuinf can do full-batch inference on CPU, which is faster on our environment)
python3 sslmain.py --pre_train_datasets arxiv sportsfit products --method graphmae --num_heads 4 --num_out_heads 1 --num_layers 3 --num_hidden 1024 --residual --in_drop 0.5 --attn_drop 0.5 --norm 'batchnorm' --lr 0.01 --weight_decay 1e-5 --activation 'prelu' --mask_rate 0.75 --drop_edge_rate 0 --replace_rate 0.2 --scheduler --lrtype 'cosine' --save_model --max_epoch 5 --subgraph_size 1024 --warmup --cpuinf
Prodigy
pretrain on arxiv
python experiments/run_single_experiment.py --dataset arxiv --root <root> --original_features False -ds_cap 24000 -val_cap 100 -test_cap 100 --emb_dim 256 --epochs 1 -ckpt_step 1000 -layers S2,U,M -lr 3e-4 -way 30 -shot 3 -qry 4 -eval_step 5000 -task cls_nm_sb -bs 1 -aug ND0.5,NZ0.5 -aug_test True -attr 1000 --device 0 --prefix MAG_PT_PRODIGY
test on History
python3 experiments/run_single_experiment.py --dataset bookhis --original_features True -ds_cap 300 -val_cap 300 -test_cap 300 --emb_dim 256 --epochs 1 -ckpt_step 1000 -layers S2,U,M -lr 3e-4 -way 12 -shot 3 -qry 4 -eval_step 50 -task classification -bs 1 -aug ND0.5,NZ0.5 -aug_test True -attr 1000 --device 0 --prefix test --root <root> -pretrained <ckpt> --eval_only True
Acknowledgements
This code repo is heavily based on OneForAll(✨), BUDDY, LLaGA, GraphMAE, Prodigy, CSTAG. Thanks for their sharing!
