RelationalDatasets
A largely incomplete but hopefully useful list of links to datasets for relational learning and inductive logic programming. No guarantees on availability.
Install / Use
/learn @joschout/RelationalDatasetsREADME
Relational Datasets
A largely incomplete but hopefully useful list of links to datasets for relational learning and inductive logic programming. No guarantees on availability.
Classic ILP datasets
A list of datasets per source.
-
The CVUT Prague Relational Dataset Repository: A large collection of ILP datasets, stored as MariaDB (SQL) datasets.
Motl, Jan, and Oliver Schulte. "The CTU prague relational learning repository." arXiv preprint arXiv:1511.03086 (2015).
-
ACE data mining system data sets: nine ILP datasets in Quinlan's FOIL format, together with scripts to convert them into ACE format (see README.txt in the ZIP). These were used in:
Jan Struyf, Jesse Davis and David Page, An efficient approximation to lookahead in relational learners. In J. Fürnkranz, T. Scheffer and M. Spiliopoulou, editors, Machine Learning: ECML 2006, 17th European Conference on Machine Learning, Proceedings. Lecture Notes in Artificial Intelligence, volume 4212, pages 775-782, Springer, 2006, [Abstract], [BibTeX].
- Muta188
- Muta230
- Financial
- Sisyphus A
- Sisyphus B
- UWCSE
- Yeast
- Carcinogenesis
- Bongard
-
- Animals
- CiteSeer
- Cora
- Epinions
- IMDB
- Kinships
- Nations
- Protein Interaction
- Radish Robot Mapping - Tutorial
- UMLS
- UW-CSE
- WebKB
-
ILP Datasets:: in SQL format
- Carcinogenesis
- Financial
- Trains
- Mutagenesis
- Imdb
- IMDB Top/Botttom Movies
-
Stephen Muggleton's data set directory:
- Trains
- alzheimers
- carcinogenesis
- chess
- e_coli
- mesh
- more_chess
- mutagenesis
- proteins
- satellite
- suramin
- utube
-
Sriraam's StARLinGLAB data sets:
- Toy Father
- Toy Cancer
- IMDB
- Cora
- UW-CSE
- WebKB
- CiteSeer
- Boston Housing
- Drug-Drug Interactions
-
- alzheimers
- carcinogenesis
- dsstox
- metabolism
- mutagenesis
- pyrimidines
- trains
-
BayesBase: Datasets posted in 3 formats: (i) as a MySQL dump for a relational schema, (ii) in the WILL format, similar to the Aleph ILP input format, (iii) in the .db format of Markov Logic Networks as implemented in the Alchemy system.
- unielwin
- Mutagenesis_std
- MovieLens_std
- MovieLens_TQ(1M)
- Financial_std
- Mondial_std
- UW_std
- imdb_MovieLens
- Hepatitis_std
- Cont_PLG_TM (Continuous database)
-
LINQS - Statistical Relational Learning Group
- Social Spammer
- Drug-Target Interaction
- Stance Classification
- CiteSeer for Document Classification
- CiteSeer for Entity Resolution
- Cora
- ArXiv
- PubMed Diabetes
- WebKB
- Terrorists
- Terrorist Attacks
-
klog Datasets as Prolog files:
- WebKB: Originally developed by M. Craven et al. (1998). The version available here is a direct conversion to Prolog of the data available at the Alchemy website.
- Internet Movie Database: Data extracted from this database has been used in a number of relational learning papers. The version available here was downloaded from the IMDb website, converted into SQL using the prodecure described in http://imdbpy.sourceforge.net/docs/README.sqldb.txt and finally a subset of the tuples was converted into a Prolog file.
- UW-CSE The data set originally developed at University of Washington for demonstrating the capabilities of Markov logic networks. The version available here is a direct conversion to Prolog of the data available at theAlchemy website.
- Bursi This data set contains 4,337 molecules labeled according to mutagenicity (2,401 mutagens and 1,936 nonmutagens). Originally developed by Kazius et al (2005) it has been used in a number of machine learning papers, especially those studying graph kernels.
- Biodegradability This is an older data set of chemical structures containing 328 compounds labeled by their half-life for aerobic aqueous biodegradation (a regression task).
-
MLnet
Among others, some ILP datasets. Note: Internet Archive's Wayback machine link
Other links:
- [Kaggle]
- KDnuggets
- Microsoft Research Open Data
- Registry of Open Data on AWS
- Awesome Public Datasets Collection
- San Francisco open data website
- Restaurants:
- https://data.sfgov.org/Health-and-Social-Services/Restaurant-Scores-LIVES-Standard/pyih-qa8i
- https://blog.bigml.com/2013/10/30/data-preparation-for-machine-learning-using-mysql/
- Restaurants:
- Stanford Large Network Dataset Collection (SNAP)
- metapath2vec: Scalable Representation Learning for Heterogeneous Networks
- Benchmark data sets for graph kernels
Related Skills
YC-Killer
2.7kA library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.
best-practices-researcher
The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app
research_rules
Research & Verification Rules Quote Verification Protocol Primary Task "Make sure that the quote is relevant to the chapter and so you we want to make sure that we want to have it identifie
groundhog
398Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).
Security Score
Audited on Feb 28, 2026
