MinoanER

Minoan ER is an Entity Resolution (ER) framework, built by researchers in Crete (the land of the ancient Minoan civilization). Entity resolution aims to identify descriptions that refer to the same entity within or across knowledge bases.

Generate Convert Improve

Install / Use

/learn @vefthym/MinoanER

About this skill

Quality Score

0/100

README

MinoanER

The website of the project is http://www.csd.uoc.gr/~vefthym/minoanER/

The functionality of this framework is described in details in the followig PhD thesis (mostly in Chapter 4): http://csd.uoc.gr/~vefthym/DissertationEfthymiou.pdf

MinoanER is implemented in Java 8+, using Apache Spark. We assume that a Spark cluster is available. Our code has been tested in a Spark cluster with HDFS and Mesos.

The steps followed by MinoanER are Blocking, Meta-blocking and Matching. Currently, the step of (token) blocking is taken from https://github.com/vefthym/ERframework/blob/master/src/NewApproaches/ExportDatasets.java but it can be easily incorporated in this repository, as a Spak task, as well.

Reference

To cite this work, please use the following reference: "Vasilis Efthymiou, George Papadakis, Kostas Stefanidis, Vassilis Christophides: MinoanER: Schema-Agnostic, Non-Iterative, Massively Parallel Resolution of Web Entities. EDBT 2019: 373-384" Pdf available here: https://openproceedings.org/2019/conf/edbt/EDBT19_paper_44.pdf

Running MinoanER

The main file is https://github.com/vefthym/MinoanER/blob/master/src/main/java/minoaner/workflow/Main.java. As documented in this file, it assumes 5 input paths and 1 output path, taken as runtime arguments:

inputBlocking: The resulting blocks from token blocking. You can generate such a file from https://github.com/vefthym/ERframework/blob/master/src/NewApproaches/ExportDatasets.java. Each line corresponds to a block and its contents. The formatting should be: blockId TAB entityIdFromD1#entityIdFromD1# ... ;entityIdFromD2#entityIdFromD2# ... All those Ids should be positive integers.

inputTriples1/2: The raw RDF triples of the first/second KB in N-triples format (without the trailing " ." part).

entityIds1/2: To save some space, we replace all entity URLs with numeric (positive integer) ids. This file contains this mapping that you should provide. Each line corresponds to one mapping and should be in the form: entityURL TAB numericId The same numericId should not be assigned to two different entityURLs and the entityURLs should be the ones appearing in the raw RDF input (inputTriples1/2).

outputPath: The (HDFS) path in which the output mappings will be stored. The format of the generated output is: entityIdFromD1 TAB entityIdFromD2 for each pair of entities that have been found to match. WARNING: the outputPath directory is deleted on each run.

Example datasets: You can find examples of datasets used in MinoanER in our project's website: http://csd.uoc.gr/~vefthym/minoanER/datasets.html. If you use those datasets, here are some helpful tips for pre-processing the data:

You can covert RDF files into classes of the form EntityProfile using this <a href="https://github.com/vefthym/ERframework/blob/master/src/DataReader/EntityReader/EntityRDFReader.java ">DataReader</a>. There is <a href="https://github.com/vefthym/ERframework/blob/master/src/DataReader/GroundTruthReader/GtRDFReader.java">another reader for the ground-truth files</a>.

Both readers have a storeSerializedObject method to store the result on the disk.

Setup and Tuning

You can tune the Spark session parameters (number of workers, executors, memory, parallelism, etc) by calling the setUpSpark method in https://github.com/vefthym/MinoanER/blob/master/src/main/java/minoaner/utils/Utils.java. The body of this method should be adjusted to reflect the resources of your Spark cluster.

In the main method, you will find some hardcoded attributes that act as entity names (labels) for the datasets that we have tested. Those attributes have been generated automatically by getting the top attributes of each KB based on the harmonic mean of support and discriminability (see related publications). You can hardcode the corresponding attributes for your KBs, or find them automatically by calling the methods found in the class https://github.com/vefthym/MinoanER/blob/master/src/main/java/minoaner/relationsWeighting/RelationsRank.java.

Related Skills

YC-Killer

2.7k

A library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.

best-practices-researcher

The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app

groundhog

400

Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).

last30days-skill

19.1k

AI agent skill that researches any topic across Reddit, X, YouTube, HN, Polymarket, and the web - then synthesizes a grounded summary