AIDA - Accurate Online Disambiguation of Entities

[AIDA][AIDA] is the named entity disambiguation system created by the Databases and Information Systems Department at the [Max Planck Institute for Informatics in Saarbücken, Germany][MPID5]. It identifies mentions of named entities (persons, organizations, locations, songs, products, ...) in text and links them to a unique identifier. Most names are ambiguous, especially family names, and AIDA resolves this ambiguity. See the EMNLP 2011 publication [EMNLP2011] for a detailed description of how it works and the VLDB 2011 publication [VLDB2011] for a description of our Web demo.

If you want to be notified about AIDA news or new releases, subscribe to our announcement mailing list by sending a mail to:

aida-news-subscribe@lists.mpi-inf.mpg.de

Introduction to AIDA

AIDA is a framework and online tool for entity detection and disambiguation. Given a natural-language text, it maps mentions of ambiguous names onto canonical entities (e.g., individual people or places) registered in the [YAGO2][YAGO] [YAGO2] knowledge base. This knowledge is useful for multiple tasks, for example:

Build an entity index. This allows one kind of semantic search, retrieve all documents where a given entity was mentioned.
Extract knowledge about the entities, for example relations between entities mention in the text.

YAGO2 entities have a one-to-one correspondence to Wikipedia pages, thus each disambiguated entity also denotes a Wikipedia URL.

Note that AIDA does not annotate common words (like song, musician, idea, ... ). Also, AIDA does not identify mentions that have no entity in the repository. Once a name is in the dictionary containing all candidates for surface strings, AIDA will map to the best possible candidate, even if the correct one is not in the entity repository

Requirements

AIDA needs a [Postgres][Postgres] database to run. We tested it starting from version 8.4, but version 9.2 will give a better performance for many queries AIDA runs, due to the ability to fetch data from the indexes.

The machine AIDA runs on should have a reasonable amount of main memory. If you are using graph coherence (see the Section Configuring AIDA), the amount of memory grows quadratically with the number of entities and thus the length of the document. Anything above 10,000 candidates will be too much for a regular desktop machine (at the time of writing) to handle and should run on a machine with more than 20GB of main memory. AIDA does the most intensive computations in parallel and thus benefits from multi-core machine.

Setting up the Entity Repository

AIDA was developed to disambiguate to the [YAGO2][YAGO] knowledge base, returning the YAGO2 identifier for disambiguated entities. However, you can use AIDA for any entity repository, given that you have keyphrases and weights for all entities. The more common case is to use AIDA with YAGO2. If you want to set it up with your own repository, see the Advanced Configuration section.

To use AIDA with YAGO2, download the repository we provide on our [AIDA website][AIDA] as a Postgres dump and import it into your database server. This will take some time, maybe even a day depending on the speed of the machine Postgres is running on. Once the import is done, you can start using AIDA immediately by adjusting the settings/database_aida.properties to point to the database. AIDA will then use nearly 3 million named entities harvested from Wikipedia for disambiguation.

Get the Entity Repository:

curl -O http://www.mpi-inf.mpg.de/yago-naga/aida/download/entity-repository/AIDA_entity_repository_2010-08-17.sql.bz2

Import it into a postgres database:

bzcat AIDA_entity_repository_2010-08-17.sql.bz2 | psql <DATABASE>

where <DATABASE> is a database on a PostgreSQL server.

Setting up AIDA

To build aida, run ant (See Apache Ant) in the directory of the cloned repository. This will create an aida.jar including all dependencies.

The main configuration is done in the files in the settings/ directory. The following files can be adjusted:

aida.properties: take the sample_settings/aida.properties and adjust it accordingly. The default values are reasonable, so if you don't want to change anything, the file is not needed at all.
database_aida.properties: take the sample_settings/database_aida.properties, put it here and adjust it accordingly. The settings should point to the Postgres database server that holds the entity repository - how to set this up is explained below.

Hands-On API Example

The main classes in AIDA are mpi.aida.Preparator for preparing an input document and mpi.aida.Disambiguator for running the disambiguation on the prepared input. A minimal call looks like this:

// Define the input.
String inputText = "When [[Page]] played Kashmir at Knebworth, his Les Paul was uniquely tuned.";

// Prepare the input for disambiguation. The Stanford NER will be run
// to identify names. Strings marked with [[ ]] will also be treated as names.
PreparationSettings prepSettings = new StanfordHybridPreparationSettings();
Preparator p = new Preparator();
PreparedInput input = p.prepare("document_id", inputText, prepSettings);

// Disambiguate the input with the graph coherence algorithm.
DisambiguationSettings disSettings = new CocktailPartyDisambiguationSettings();    
Disambiguator d = new Disambiguator(input, disSettings);
DisambiguationResults results = d.disambiguate();

// Print the disambiguation results.
for (ResultMention rm : results.getResultMentions()) {
  ResultEntity re = results.getBestEntity(rm);
  System.out.println(rm.getMention() + " -> " + re +
  " (" + AidaManager.getWikipediaUrl(re) + ")");
}

The ResultEntity contains the AIDA ID via the getEntity() method. This can be transformed into a Wikipedia URL by calling AidaManager.getWikipediaUrl() for the result entity.

See the mpi.aida.config.settings.disambiguation package for all possible predefined configurations, passed to the Disambiguator:

PriorOnlyDisambiguationSettings: Annotate each mention with the most prominent entity.
LocalDisambiguationSettings: Use the entity prominence and the keyphrase-context similarity to disambiguate.
CocktailPartyDisambiguationSettings: Use a graph algorithm on the entity coherence graph ([MilneWitten] link coherence) to disambiguate.
CocktailPartyKOREDisambiguationSettings: Use a graph algorithm on the entity coherence graph ([KORE] link coherence) to disambiguate.

Hands-On Command Line Call Example

Build AIDA:

ant
Run the CommandLineDisambiguator:

java -Xmx4G -cp aida.jar mpi.aida.CommandLineDisambiguator GRAPH <INPUT-FILE>

<INPUT-FILE> is path to the text file to be annotated with entities. The format for <INPUT-FILE> should be plain text with UTF-8 encoding.

Instead of GRAPH, you can put one of the following, corresponding to the settings described above:

PRIOR: PriorOnlyDisambiguationSettings
LOCAL: LocalDisambiguationSettings
GRAPH: CocktailPartyDisambiguationSettings
GRAPH-KORE: CocktailPartyKOREDisambiguationSettings

The output will be an HTML file with annotated mentions, linking to the corresponding Wikipedia page.

Input Format

The input of AIDA is a text (as Java String) or file in UTF-8 encoding. By default, named entities are recognized by the Stanford NER component of the [CoreNLP][CoreNLP] tool suite. In addition, mentions can be marked up by square brackets, as in this example "Page":

When [[Page]] played Kashmir at Knebworth, his Les Paul was uniquely tuned.

The mention recognition can be configured by using different PreparationSettings in the mpi.aida.config.settings.preparation package:

StanfordHybridPreparationSettings: Use Stanford CoreNLP NER and allow manual markup using [[...]]
StanfordManualPreparationSettings: Use Stanford CoreNLP only for tokenization and sentence splitting, mentions need to be marked up by [[...]].

The PreparationSettings are passed to the Preparator, see the Hands-On API Example.

Advanced Configuration

Configuring the DisambiguationSettings

The mpi.aida.config.settings.DisambiguationSettings contain all the configurations for the weight computation of the disambiguation graph. The best way to configure the DisambiguationSettings for constructing the disambiguation graph is to use one of the predefined settings objects in the mpi.aida.config.settings.disambiguation package, see below.

Pre-configured DisambiguationSettings

These pre-configured DisambiguatorSettings objects can be passed to the Disambiguator:

PriorOnlyDisambiguationSettings: Annotate each mention with the most prominent entity.
LocalDisambiguationSettings: Use the entity prominence and the keyphrase-context similarity to disambiguate.
CocktailPartyDisambiguationSettings: Use a graph algorithm on the entity coherence graph ([MilneWitten] link coherence) to disambiguate.
CocktailPartyKOREDisambiguationSettings: Use a graph algorithm on the entity coherence graph ([KORE] link coherence) to disambiguate.

DisambiguationSettings Parameters

The principle parameters are (corresponding to all the instance variables of the DisambiguationSettings object):

alpha: Balances the mention-entity edge weights (alpha) and the entity-entity edge weights (1-alpha).
disambiguationTechnique: Technique to solve the disambiguation graph with. Most commonly this is LOCAL for mention-entity similarity edges only and GRAPH to include the entity coherence.
disambiguationAlgorithm: If TECHNIQUE.GRAPH is chosen above, this specifies the algorithm to solve the disambiguation graph. Can be COCKTAIL_PARTY for the full disambiguation graph and COCKTAIL_PARTY_SIZE_CONSTRAINED for a heuristically pruned graph.
useExhaustiveSearch: Set to true to use exhaustive search

Aida

Install / Use

README