GraphData
A collection of graph data used for semi-supervised node classification.
Install / Use
/learn @EdisonLeeeee/GraphDataREADME
GraphData
- GraphData
Usage of .npz datasets
import os.path as osp
import numpy as np
def load_npz(filepath):
filepath = osp.abspath(osp.expanduser(filepath))
if not filepath.endswith('.npz'):
filepath = filepath + '.npz'
if osp.isfile(filepath):
with np.load(filepath, allow_pickle=True) as loader:
loader = dict(loader)
for k, v in loader.items():
if v.dtype.kind in {'O', 'U'}:
loader[k] = v.tolist()
return loader
else:
raise ValueError(f"{filepath} doesn't exist.")
e.g., run load_npz('cora') and it returns a dict instance loader, it might have the following keys:
adj_matrix: scipy.sparse.csr_matrix, adjacency matrix. NOTE: the adjacency matrix might not be symmetric.node_attr: scipy.sparse.csr_matrix or numpy.ndarray, node attribute matrixnode_label: scipy.sparse.csr_matrix or numpy.ndarray, node labelsmetadata: dict, additional metadata such as text.
Glance of graphs
|name|num_nodes|num_edges|num_attrs|density|is_directed| |:---:|:---:|:---:|:---:|:---:|:---:| |karate_club|34|78|0|6.7474%|0| |polblogs|1,490|19,025|0|0.8569%|1| |cora|2,708|5,429|1,433|0.0740%|1| |cora_ml|2,995|8,416|2,879|0.0938%|1| |acm|3,025|13,128|1,870|0.1435%|0| |uai|3,067|28,314|4,973|0.3010%|0| |citeseer|3,312|4,715|3,703|0.0430%|1| |citeseer_full|4,230|5,358|602|0.0299%|1| |blogcatalog|5,196|171,743|8,189|0.6361%|0| |flickr|7,575|239,738|12,047|0.4178%|0| |amazon_photo|7,650|143,663|745|0.2455%|1| |amazon_cs|13,752|287,209|767|0.1519%|1| |dblp|17,716|52,867|1,639|0.0168%|0| |coauthor_cs|18,333|81,894|6,805|0.0244%|0| |pubmed|19,717|44,324|500|0.0114%|0| |cora_full|19,793|65,311|8,710|0.0167%|1| |coauthor_phy|34,493|247,962|8,415|0.0208%|0|
Single Graph
Planetoid datasets: CORA, CiteSeer, PubMed and Nelll
citation network and knowledge graph(NELL) datasets in https://github.com/kimiyoung/planetoid
nodes are documents and edges are citation links. Label rate denotes the number of labeled nodes that are used for training divided by the total number of nodes in each dataset.
@article{SenNBGGE08,
author = {Prithviraj Sen and
Galileo Namata and
Mustafa Bilgic and
Lise Getoor and
Brian Gallagher and
Tina Eliassi{-}Rad},
title = {Collective Classification in Network Data},
journal = {{AI} Mag.},
volume = {29},
number = {3},
pages = {93--106},
year = {2008}
}
Amazon Computers and Amazon Photo
Amazon Computers and Amazon Photo are segments of the Amazon co-purchase graph,where nodes represent goods, edges indicate that two goods are frequently bought together, node features are bag-of-words encoded product reviews, and class labels are given by the product category.
@inproceedings{McAuleyTSH15,
author = {Julian J. McAuley and
Christopher Targett and
Qinfeng Shi and
Anton van den Hengel},
editor = {Ricardo Baeza{-}Yates and
Mounia Lalmas and
Alistair Moffat and
Berthier A. Ribeiro{-}Neto},
title = {Image-Based Recommendations on Styles and Substitutes},
booktitle = {Proceedings of the 38th International {ACM} {SIGIR} Conference on
Research and Development in Information Retrieval, Santiago, Chile,
August 9-13, 2015},
pages = {43--52},
publisher = {{ACM}},
year = {2015}
}
Coauthor CS and Coauthor Physics
Coauthor CS and Coauthor Physics are co-authorship graphs based on the Microsoft Academic Graph from the KDD Cup 2016 challenge. Here, nodes are authors, that are connected by an edge if they co-authored a paper; node features represent paper keywords for each author’s papers, and class labels indicate most active fields of study for each author.
https://kddcup2016.azurewebsites.net/
The above datasets are collected from
https://github.com/shchur/gnn-benchmark
@article{shchur2018pitfalls,
title={Pitfalls of Graph Neural Network Evaluation},
author={Shchur, Oleksandr and Mumme, Maximilian and Bojchevski, Aleksandar and G{\"u}nnemann, Stephan},
journal={Relational Representation Learning Workshop, NeurIPS 2018},
year={2018}
}
DBLP
@inproceedings{PanWZZW16,
author = {Shirui Pan and
Jia Wu and
Xingquan Zhu and
Chengqi Zhang and
Yang Wang},
editor = {Subbarao Kambhampati},
title = {Tri-Party Deep Network Representation},
booktitle = {Proceedings of the Twenty-Fifth International Joint Conference on
Artificial Intelligence, {IJCAI} 2016, New York, NY, USA, 9-15 July
2016},
pages = {1895--1901},
publisher = {{IJCAI/AAAI} Press},
year = {2016}
}
CiteSeer_Full
@inproceedings{GilesBL98,
author = {C. Lee Giles and
Kurt D. Bollacker and
Steve Lawrence},
title = {CiteSeer: An Automatic Citation Indexing System},
booktitle = {Proceedings of the 3rd {ACM} International Conference on Digital Libraries,
June 23-26, 1998, Pittsburgh, PA, {USA}},
pages = {89--98},
publisher = {{ACM}},
year = {1998}
}
CORA_Full and CORA-ML
CORA_Full, citation network dataset, an extended version of CORA
CORA-ML, extracted from the original data the entire network of CORA
@article{McCallumNRS00,
author = {Andrew McCallum and
Kamal Nigam and
Jason Rennie and
Kristie Seymore},
title = {Automating the Construction of Internet Portals with Machine Learning},
journal = {Inf. Retr.},
volume = {3},
number = {2},
pages = {127--163},
year = {2000}
}
The above datasets are collected from
https://github.com/abojchevski/graph2gauss
@inproceedings{bojchevski2018deep,
title={Deep Gaussian Embedding of Graphs: Unsupervised Inductive Learning via Ranking},
author={Aleksandar Bojchevski and Stephan Günnemann},
booktitle={International Conference on Learning Representations},
year={2018},
url={https://openreview.net/forum?id=r1ZdKJ-0W},
}
233K nodes, 11.6M edges, 602 node features
Source-SNAP
Link: http://snap.stanford.edu/graphsage/
@inproceedings{hamilton2017inductive,
author = {Hamilton, William L. and Ying, Rex and Leskovec, Jure},
title = {Inductive Representation Learning on Large Graphs},
booktitle = {NIPS},
year = {2017}
}
Source-DGL
Link: https://data.dgl.ai/dataset/reddit.zip
Source-TUM
Link: https://ndownloader.figshare.com/files/23742119
NELL
https://github.com/kimiyoung/planetoid
@inproceedings{CarlsonBKSHM10,
author = {Andrew Carlson and
Justin Betteridge and
Bryan Kisiel and
Burr Settles and
Estevam R. Hruschka Jr. and
Tom M. Mitchell},
editor = {Maria Fox and
David Poole},
title = {Toward an Architecture for Never-Ending Language Learning},
booktitle = {Proceedings of the Twenty-Fourth {AAAI} Conference on Artificial Intelligence,
{AAAI} 2010, Atlanta, Georgia, USA, July 11-15, 2010},
publisher = {{AAAI} Press},
year = {2010}
@inproceedings{DBLP:conf/icml/YangCS16,
author = {Zhilin Yang and
William W. Cohen and
Ruslan Salakhutdinov},
editor = {Maria{-}Florina Balcan and
Kilian Q. Weinberger},
title = {Revisiting Semi-Supervised Learning with Graph Embeddings},
booktitle = {Proceedings of the 33nd International Conference on Machine Learning,
{ICML} 2016, New York City, NY, USA, June 19-24, 2016},
series = {{JMLR} Workshop and Conference Proceedings},
volume = {48},
pages = {40--48},
publisher = {JMLR.org},
year = {2016}
}
Flickr and BlogCatalog
BlogCatalog : It is a dataset of a blog community social network, which contains 5,196 users as nodes, 171,743 edges indicating the user interactions, and 8,189 attribute categories denoting the keywords of their blogs. Users could register their blogs into six different predefined classes, which are set as labels.
Flickr: It is a benchmark attributed social network dataset containing 7,575 nodes. Each node is a Flickr user and each attribute category is a tag related to the photos shared by users. There are 239,738 undirected edges in this network, which indicate the following relationships among users. The nine groups that users have joined are considered as target labels.
@inproceedings{LiHTL15,
author = {Jundong Li and
Xia Hu and
Jiliang Tang and
Huan Liu},
editor = {James Bailey and
Alistair Moffat and
Charu C. Aggarwal and
Maarten de Rijke
Security Score
Audited on Jan 23, 2026
