YAGO

YAGO is a large semantic knowledge base, derived from Wikipedia, WordNet, WikiData, GeoNames, and other data sources. Currently, YAGO knows more than 17 million entities (like persons, organizations, cities, etc.) and contains more than 150 million facts about these entities.

YAGO is special in several ways:

The accuracy of YAGO has been manually evaluated, proving a confirmed accuracy of 95% (*). Every relation is annotated with its confidence value.
YAGO combines the clean taxonomy of WordNet with the richness of the Wikipedia category system, assigning the entities to more than 350,000 classes.
YAGO is anchored in time and space. YAGO attaches a temporal dimension and a spatial dimension to many of its facts and entities.
In addition to taxonomy, YAGO has thematic domains such as "music" or "science" from WordNet Domains.
YAGO extracts and combines entities and facts from 10 Wikipedias in different languages.

YAGO is jointly developed at the DBWeb group at Télécom ParisTech University, the Databases and Information Systems group at the Max Planck Institute for Informatics, and Ambiverse.

(*) Not every version of YAGO is manually evaluated. Most notably, the version generated by this code may not be the one that we evaluated! Check the versions on the YAGO download page

YAGO Code Repository

Target audience

If you are just interested in the data of YAGO, there is no need to use the present code repository. You can download data of YAGO from the YAGO homepage.

If you are interested in using the source code of YAGO, or in contributing to it, read on. The source code of YAGO is a Java project that extracts facts from Wikipedia and the other data sources, and stores these facts in files. These files make up the YAGO knowledge base.

If you run the code yourself, you can define (a) what Wikipedia languages to cover, and (b) which specific Wikipedia, Wikidata, and Wikimedia Commons snapshots should be used during the build.

Project components

The following Java projects belong to YAGO

Javatools: These classes are Java utilities. They are shared with other projects.
Basics: These classes are used to represent facts, TSV files, etc. The files in "data" describe the schema of YAGO.
YAGO: This project contains
- all main YAGO extractors
- some hand-crafted data
- scripts that run YAGO

Prerequisites

To run YAGO, you need the following:

Java 8
Maven
for the automated downloading of data resources:
- Python 2.7
- the Python module requests (you can use pip install requests to install this module)
- a unix machine
a machine with at least 256 GB of RAM and 1 TB of disk space

The YAGO configuration file

YAGO is configured with a configuration file. Use this template to generate your own copy of that file. It should contain the following lines:

reuse = true|false: Specifies whether a new run of YAGO should overwrite or re-use the facts that have already been generated in a previous run.
yagoFolder = ...: Specifies the folder where the YAGO facts shall be stored.
languages = en, de, fr, nl, it, es, ro, pl, ar, fa: Specifies the Wikipedia languages from which YAGO shall extract the facts. Use ISO 639-1 language codes.
extractors: List of extractors to run. By default, just use the list from the template.
subgraphClasses: Specify a single class (e.g. <wordnet_person_100007846>), or list of classes (e.g. <wikicat_Rock_musicians>,<wikicat_American_singers>). The final YAGO output will contain only entities of the specified classes, and entities connected to them. Additionally, the final YAGO output will contain entities specified in subgraphEntities.
subgraphEntities: Specify a single entity (e.g. <Jimmy_Page>), or list of entities (e.g. <Kashmir_(song>,<Knebworth_Festival_1979>). The final YAGO output will contain only these entities and entities connected to them. Additionally, the final YAGO output will contain entities specified in subgraphClasses.

Downloading the data sources

YAGO needs the following data sources:

Wikipedia: the latest version of pages-articles.
Wikidata: the latest version of wikidata-DATE-all-ttl.
Wikimedia Commons: the [latest version(https://dumps.wikimedia.org/commonswiki/latest/commonswiki-latest-pages-articles.xml.bz2) of pages-articles.
Geonames: the files countryInfo, hierarchy, alternateNames, userTags, featureCodes_en, allCountries

If you want to download the latest versions of the data sources automatically, add the following line to your YAGO configuration file:

dumpsFolder = ...: points to a folder where the data sources live.

Then run the following code (works on Linux or Mac):

python scripts/dumps/downloadDumps.py -y <PATH_TO_CONFIGURATION_FILE>

This code will create a new configuration file, which you will have to use in the sequel.

Alternatively, you can download the required data sources manually. Then add the following lines to your configuration file:

wikipedias = ...: a comma-separated list of the Wikipedia dumps, in the order of the languages specified with the languages parameter.
wikidata = ...: Points to the WikiData file.
commons_wiki = ...: Points to the WikiCommons file.
geonames = ...: Points to the folder where Geonames is stored.

Running YAGO

Once the configuration file has been prepared and all required resources have been downloaded, a YAGO build can be started like this:

cd <PATH_TO_YAGO3>
export MAVEN_OPTS=-Xmx220G
mvn clean verify exec:java -Dexec.args=<PATH_TO_CONFIGURATION_FILE>

Watch out to use the new configuration file if you used the Python script to download the data resources. Allocating 220G of main memory to YAGO is a reasonable estimate which typically works fine, but of course this highly depends on the number of languages you execute the build for. Increase this value if necessary.

Once the processing finished, all output can be found in the directory given by the yagoFolder parameter in your configuration file.

Code Architecture

The overall goal of the YAGO architecture is to enable cooperation of several contributors, facilitate debugging and maintenance, and allow users to download only particular pieces of YAGO ("YAGO a la carte"). In short: YAGO is modular, both in code and in data.

The current architecture pursues the goal of modularity at the expense of longer running times and inefficiency. The rationale is that we do not care if the extraction runs a few hours longer, if we can save a few hours of human work in return.

Themes

The YAGO data is split into "themes". Each theme corresponds to a file on disk. A theme contains facts (either in RDF or in TSV, see the section on data formats below). Themes can overlap, but should not. The class basics.Theme implements a theme.

Themes that are free of duplicates and ready for export are called "final themes". These live in the same folder as the other themes, but start with yago.... The final themes make up the YAGO knowledge base.

Extractors

An extractor is a unit of Java code that takes as input (1) one or more themes and/or (2) a raw data file, and that produces as output one or more themes.

Extractors implement extractors.Extractor. Common postprocessing steps (such as translating entities) implement the class FollowUpExtractor. This defines a dependency graph of extractors. Extractors are scheduled in the right order and called by main.ParallelCaller.

Techniques

Facts can have a meta-fact extractionSource. This meta-fact can have a meta-fact extractionTechnique. There should be a finite set of techniques that does not grow with the data.

Facts that do not have such an annotation are assumed to be trivially clean.

Packages

Extractors are split into the following packages:

deduplicators: extractors aggregating results from previous ones, removing duplicate facts
extractors: abstract classes specifying the interfaces for extractors
followUp: classes implementing filtering and mapping postprocessing steps
fromGeonames: extractors working on Geonames
fromOtherSources: extractors working on Wordnet, Wikidata, etc.
fromThemes: extractors depending on other extractors
fromWikipedia: extractors working on Wikipedia dumps
main: Contains the scheduler that starts the extractors

Data Format

In YAGO (as in RDF), each fact consists of a subject, a predicate, and an object. Every fact can have a fact id. This allows facts to talk about other facts. The fact id is simply computed as a hash from the subject, predicate, and object of the fact. An exa

Yago3

Install / Use

README