Datasets
A repository of pretty cool datasets that I collected for network science and machine learning research.
Install / Use
/learn @benedekrozemberczki/DatasetsREADME
Datasets

Datasets collected for network science, graph mining, deep learning and general machine learning research.
<p align="center"> <img width="600" src="/images/field.png"> </p>Contents
- Twitch Gamers
- LastFM Asia Social Network
- Deezer Europe Social Network
- GitHub StarGazer Graphs
- Twitch Ego Nets
- Reddit Thread Graphs
- Deezer Ego Nets
- GitHub Social Network
- Deezer Social Networks
- Facebook Page-Page Networks
- Wikipedia Article Networks
- Twitch Social Networks
- Facebook Large Page-Page Network
Twitch Gamers
<p align="center"> <img width="200" src="/images/twitch.png"> </p>Description
<p align="justify"> A social network of Twitch users collected from the public API in Spring 2018. Nodes are Twitch users and edges are mutual follower relationships between them. The graph forms a single strongly connected component without missing attributes. The machine learning tasks related to the graph are count data regression and node classification. There are 6 specific tasks:</p>- Explicit content streamer identification.
- Broadcaster language prediction.
- User lifetime estimation.
- Churn prediction.
- Affiliate status identification.
- View count estimation.
Links
Properties
- Directed: No.
- Node features: No.
- Edge features: No.
- Node labels: Yes.
- Temporal: No.
| | Twitch Gamers | |---|---| | Nodes |168,114 | | Edges | 6,797,557 | | Density | 0.0005 | | Transitvity | 0.0184|
Possible tasks
- Binary node classification
- Multi-class node classification
- Count data regression
- Link prediction
- Community detection
- Community detection with ground truth
- Network visualization
Citing
>@misc{rozemberczki2021twitch,
title = {Twitch Gamers: a Dataset for Evaluating Proximity Preserving and Structural Role-based Node Embeddings},
author = {Benedek Rozemberczki and Rik Sarkar},
year = {2021},
eprint = {2101.03091},
archivePrefix = {arXiv},
primaryClass = {cs.SI}
}
LastFM Asia Social Network
<p align="center"> <img width="200" src="/images/lastfm.png"> </p>Description
<p align="justify"> A social network of LastFM users which was collected from the public API in March 2020. Nodes are LastFM users from Asian countries and edges are mutual follower relationships between them. The vertex features are extracted based on the artists liked by the users. The task related to the graph is multinomial node classification - one has to predict the location of users. This target feature was derived from the country field for each user. </p>Links
Properties
- Directed: No.
- Node features: Yes.
- Edge features: No.
- Node labels: Yes. Multinomial.
- Temporal: No.
| | LastFM | |---|---| | Nodes |7,624 | | Edges | 27,806 | | Density | 0.001 | | Transitvity | 0.179 |
Possible tasks
- Multi-class node classification
- Link prediction
- Community detection
- Network visualization
Citing
@inproceedings{feather,
title={{Characteristic Functions on Graphs: Birds of a Feather, from Statistical Descriptors to Parametric Models}},
author={Benedek Rozemberczki and Rik Sarkar},
year={2020},
pages={1325–1334},
booktitle={Proceedings of the 29th ACM International Conference on Information and Knowledge Management (CIKM '20)},
organization={ACM},
}
Deezer Europe Social Network
<p align="center"> <img width="200" src="/images/deezer.jpg"> </p>Description
<p align="justify"> A social network of Deezer users which was collected from the public API in March 2020. Nodes are Deezer users from European countries and edges are mutual follower relationships between them. The vertex features are extracted based on the artists liked by the users. The task related to the graph is binary node classification - one has to predict the gender of users. This target feature was derived from the name field for each user. </p>Links
Properties
- Directed: No.
- Node features: Yes.
- Edge features: No.
- Node labels: Yes. Binary.
- Temporal: No.
| | Deezer | |---|---| | Nodes |28,281 | | Edges | 92,752 | | Density | 0.0002 | | Transitvity | 0.0959 |
Possible tasks
- Binary node classification
- Link prediction
- Community detection
- Network visualization
Citing
@inproceedings{feather,
title={{Characteristic Functions on Graphs: Birds of a Feather, from Statistical Descriptors to Parametric Models}},
author={Benedek Rozemberczki and Rik Sarkar},
year={2020},
pages={1325–1334},
booktitle={Proceedings of the 29th ACM International Conference on Information and Knowledge Management (CIKM '20)},
organization={ACM},
}
GitHub StarGazer Graphs
<p align="center"> <img width="200" src="/images/Octocat.png"> </p>Description
<p align="justify"> The social networks of developers who starred popular machine learning and web development repositories (with at least 10 stars) until 2019 August. Nodes are users and links are follower relationships. The task is to decide whether a social network belongs to a web or machine learning repository. We only included the largest component (at least with 10 users) of graphs.</p>Link
Properties
- Number of graphs: 12,725
- Directed: No.
- Node features: No.
- Edge features: No.
- Graph labels: Yes. Binary-labeled.
- Temporal: No.
| | Min |Max | |---|---|---| | Nodes |10 | 957 | | Density | 0.003 |0.561 | | Diameter | 2 | 18 |
Possible Tasks
- Graph classification
Citing
@inproceedings{karateclub,
title = {{Karate Club: An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs}},
author = {Benedek Rozemberczki and Oliver Kiss and Rik Sarkar},
year = {2020},
pages = {3125–3132},
booktitle = {Proceedings of the 29th ACM International Conference on Information and Knowledge Management (CIKM '20)},
organization = {ACM},
}
Twitch Ego Nets
<p align="center"> <img width="200" src="/images/twitch.png"> </p>Description
<p align="justify"> The ego-nets of Twitch users who participated in the partnership program in April 2018. Nodes are users and links are friendships. The binary classification task is to predict using the ego-net whether the ego user plays a single or multple games. Players who play a single game usually have a more dense ego-net.</p>Link
Properties
- Number of graphs: 127,094
- Directed: No.
- Node features: No.
- Edge features: No.
- Graph labels: Yes. Binary-labeled.
- Temporal: No.
| | Min |Max | |---|---|---| | Nodes |14 | 52 | | Density | 0.038 |0.967 | | Diameter | 1 | 2 |
Possible Tasks
- Graph classification
Citing
@inproceedings{karateclub,
title = {{Karate Club: An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs}},
author = {Benedek Rozemberczki and Oliver Kiss and Rik Sarkar},
year = {2020},
pages = {3125–3132},
booktitle = {Proceedings of the 29th ACM International Conference on Information and Knowledge Management (CIKM '20)},
organization = {ACM},
}
Reddit Thread Graphs
<p align="center"> <img width="200" src="/images/reddit-logo-png-transparent.png"> </p>Description
<p align="justify"> Discussion and non-discussion based threads from Reddit which we collected in May 2018. Nodes are Reddit users who participate in a discussion and links are replies between them. The task is to predict whether a thread is discussion based or not (binary classification). </p>Link
Properties
- Number of graphs: 203,088
- Directed: No.
- Node features: No.
- Edge features: No.
- Graph labels: Yes. Binary-labeled.
- Temporal: No.
| | Min |Max | |---|---|---| | Nodes |11 | 97 | | Density | 0.021 |0.382 | | Diameter | 2 | 27 |
Possible Tasks
- Graph classification
Citing
@inproceedings{karateclub,
title = {{Karate Club: An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs}},
author = {Benedek Rozemberczki and Oliver Kiss and Rik Sarkar},
year = {2020},
pages = {3125–3132},
Related Skills
feishu-drive
337.4k|
things-mac
337.4kManage Things 3 via the `things` CLI on macOS (add/update projects+todos via URL scheme; read/search/list from the local Things database)
clawhub
337.4kUse the ClawHub CLI to search, install, update, and publish agent skills from clawhub.com
yu-ai-agent
1.9k编程导航 2025 年 AI 开发实战新项目,基于 Spring Boot 3 + Java 21 + Spring AI 构建 AI 恋爱大师应用和 ReAct 模式自主规划智能体YuManus,覆盖 AI 大模型接入、Spring AI 核心特性、Prompt 工程和优化、RAG 检索增强、向量数据库、Tool Calling 工具调用、MCP 模型上下文协议、AI Agent 开发(Manas Java 实现)、Cursor AI 工具等核心知识。用一套教程将程序员必知必会的 AI 技术一网打尽,帮你成为 AI 时代企业的香饽饽,给你的简历和求职大幅增加竞争力。
Security Score
Audited on Mar 15, 2026
