<center><img src="images/graphgpt.png" style="width: 5%"> GraphGPT: Graph Instruction Tuning for Large Language Models</center>

<div align='center'> <a href='https://tjb-tech.github.io/'>Jiabin Tang</a>, <a href='http://yuh-yang.github.io'>Yuhao Yang</a>, <a href='#'>Wei Wei</a>, <a href='#'>Lei Shi</a>, <a href='#'>Suqi Cheng</a>, <a href='https://www.yindawei.com/'>Dawei Yin</a> and <a='https://sites.google.com/view/chaoh/home'>Chao Huang*</a>. (*Correspondence )

<strong><a href='https://sites.google.com/view/chaoh/home'>Data Intelligence Lab</a>@<a href='https://www.hku.hk/'>University of Hong Kong</a></strong>, Baidu Inc.

<img src='images/GraphGPT_notext.jpeg' />

This repository hosts the code, data and model weight of GraphGPT (SIGIR'24 full paper track).

</div>

🎉 News

[x] [2024.03.26]🎯🎯📢📢Our GraphGPT is accepted by SIGIR'24 in the Full paper track (20.1% acceptance rate)! Congrats to all GraphGPT team! 🎉🎉🎉
[x] [2023.12.26]🎯🎯📢📢We have updated the efficient and lightweight training code. With the updated script, it is possible to perform two-stage instruction tuning on two Nvidia 3090 GPUs (24 GB each). The specific deployment and fine-tuning methods are as follows: 🎄🎄

0. Environment Update:

The lightweight training requires PyTorch 2.1+, so we need to update corresponding libraries:

# if you have set up the env for GraphGPT earlier
pip uninstall torch
pip uninstall torchvision
pip uninstall torchaudio
# CUDA 11.8
pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu118

# update pyg for the PyTorch 2.1+
pip install torch_geometric
pip install pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv -f https://data.pyg.org/whl/torch-2.1.0+cu118.html

# install lightning
pip install lightning

1. Update the Graph Data

Due to compatibility issues, if you are using the previously released graph data, we recommend downloading and updating it according to the provided link: updated graph data.

2. Run the Scripts

You can run the scripts as follow:

Stage-1:

cd path/to/GraphGPT
sh ./scripts/tune_script/graphgpt_stage1.sh

Stage-2:

cd path/to/GraphGPT
sh ./scripts/tune_script/graphgpt_stage2.sh

[x] [2023.12.14]📢📢Thank you for the support from the research community. We have compiled a list of frequently asked questions (FAQs) regarding running and environment issues in the following FAQ list. Please take a look. Wishing everyone an early Merry Christmas!🎄🎄

For 'pretrain_graph_model_path' is not defined. Please refer to issue #7.
If there is something wrong for you to use flash attetion, just comment the replace_llama_attn_with_flash_attn() in line 8 in https://github.com/HKUDS/GraphGPT/blob/main/graphgpt/train/train_mem.py. For more details, please refer to #17
If you meet some error about package conflict or environment setup (especially fastchat), please refer to issue #9 and issue #11.
If you meet No module named 'graphgpt' error, you could refer to issue #56

</details>

🎯🎯📢📢 We have made significant updates to the models and data used in our GraphGPT on 🤗 Huggingface. We highly recommend referring to the table below for further details:

| 🤗 Huggingface Address | 🎯 Description | | ------------------------------------------------------------ | ------------------------------------------------------------ | | huggingface.co/Jiabin99/GraphGPT-7B-mix-all | It's the checkpoint of our GraphGPT based on Vicuna-7B-v1.5 tuned on instruction data Arxiv-PubMed-mix-NC-LP | | huggingface.co/Jiabin99/Arxiv-PubMed-GraphCLIP-GT | It's the checkpoint of the pre-trained graph transformer (GT) trained on Arxiv and PubMed using Text-Graph grounding. | | huggingface.co/datasets/Jiabin99/Arxiv-PubMed-mix-NC-LP | This's the mixing instruction dataset with node classification (NC) and link prediction (LP) on Arxiv and PubMed. | | huggingface.co/datasets/Jiabin99/GraphGPT-eval-instruction | We release all instruction dataset for our evaluation. | | huggingface.co/datasets/Jiabin99/All_pyg_graph_data | We merge all utilized graph data. | | huggingface.co/datasets/Jiabin99/graph-matching | This is the instruction data used in graph-matching stage. |

[x] [2023.10.28]📢📢For the Chinese version of the explanation, please refer to this article.
[x] [2023.10.26]🔥🔥Release our utilized Instruction data.
[x] [2023.10.26]🔥🔥Release checkpoints of our GraphGPT and pre-trained graph encoder.
[x] [2023.10.23] 🚀🚀 The full paper of our GraphGPT is available at https://arxiv.org/abs/2310.13023. Please check out it and give us more feedbacks!
[x] [2023.10.15] 🚀🚀 Release the code of GraphGPT.

👉 TODO

[ ] Exploring the potential of our GraphGPT for more graph learning tasks.
[ ] ...

Brief Introduction

we present the GraphGPT framework that aligns LLMs with graph structural knowledge with a graph instruction tuning paradigm.

Structural Information Encoding with Text-Graph Grounding. To enhance the understanding of graph structural information by large language models, our framework emphasizes aligning the encoding of graph structures with the natural language space. This alignment aims to enable language models to effectively comprehend and interpret the structural elements of the graph, leveraging their inherent language understanding capabilities. To achieve this objective, we introduce a text-graph grounding paradigm that generates prompts designed to preserve the graph’s structural context for language models. This paradigm acts as a bridge, connecting the semantic understanding of textual information with the inherent structural relationships found within the graph.
Dual-Stage Graph Instruction Tuning. The dual-stage graph instruction tuning paradigm proposed in this work builds upon the concept of instruction tuning, which has been recently introduced to enhance the adaptability of language models for specific domains. In this paradigm, we aim to align the language capacity of the model with the nuances of graph learning tasks, enabling the language model to generate more accurate and contextually appropriate responses for graph-structured data.
Chain-of-Thought (CoT) Distillation. When faced with diverse graph data, language models may encounter new or unfamiliar patterns and structures. This distribution shift can pose challenges in generating accurate and coherent responses, especially when the number of node classes varies across different types of graph data. To address this challenge and boost accuracy in the presence of distribution shift, it is essential to equip our GraphGPT with step-by-step reasoning abilities. In this regard, we propose utilizing the Chain-of-Thought (COT) technique [47], which explicitly models the flow of thoughts and reasoning steps. By incorporating COT, our language model improves the coherence and consistency of generated text. It enables the model to follow a logical progression of ideas, enhancing its ability to understand and reason about the given graph data.

For more technical details, kindly refer to the paper and the project website of our Graph.

Getting Started

GraphGPT

Install / Use

README