LinkedHypernymsDataset

LinkedHypernymsDataset extraction framework makes RDF dataset consisting of DBpedia resources (as subjects) and types of these resources (as objects). The extraction framework returns several datasets:

Core - types of resources are DBpedia ontology classes built on Extension dataset and a hypernym pattern matching (most accurate, most specific)
Inference - types of resources are DBpedia ontology classes build on the Extension dataset and a statistical type inference algorithm (less accurate, less specific)
Extension - types of resources are other resources build on the Raw dataset and the first hypernym word hit from wikipedia API (highest type specificity)
Raw - all hypernyms are string literals extracted from the first sentence of a wikipedia resource abstract.

The extraction process tries to find the hyperonymum for each DBpedia resource (HypernymExtractor module) which is transformed to another DBpedia resource and then is mapped to a DBpedia ontology class (OntologyCleanup module and TypeInferrer module). Supported languages are English, German and Dutch.

Requirements

Gate 8.0
Maven 2+
Java 8
Downloaded current DBpedia datasets for the set language (it is possible to use the Downloader module).
- Mapping-based Types (for english and the set language)
- Mapping-based Types transitive (for english and the set language)
- Short Abstracts (for the set language)
- Disambiguations (for the set language)
- Inter-Language Links (only english dataset is required)
- DBpedia Ontology (owl)
Memcached endpoint
4GB RAM or more

Preparation

First download the current version of the LHD extraction framework:

git clone https://github.com/KIZI/LinkedHypernymsDataset.git
cd LinkedHypernymsDataset
git fetch

Recommended file structure in the root directory:

* Core
* HypernymExtractor
* OntologyCleanup
* TypeInferrer
* Downloader
* data
  * datasets
    * dbpedia_2015.owl                      // DBpedia ontology
    * instance_types_LANG.nt                // DBpedia Mapping-based Types dataset for the set language
    * instance_types_en.nt                  // DBpedia Mapping-based Types dataset for the english language
    * instance_types_transitive_LANG.nt     // DBpedia Mapping-based Types transitive dataset for the set language
    * instance_types_tansitive_en.nt        // DBpedia Mapping-based Types transitive dataset for the english language
    * interlanguage_links_en.nt             // DBpedia Inter-Language Links dataset for English
    * disambiguations_LANG.nt               // DBpedia Disambiguations dataset for the set language
    * short_abstracts_LANG.nt               // DBpedia Short Abstracts dataset for the set language
    * exclude-types                         // Handwritten rules - excluded types (optional)
    * override-types                        // Handwritten rules - mappings of types to another one (optional)
  * grammar
    * de_hearst.jape                        // JAPE grammar for German
    * en_hearst.jape                        // JAPE grammar for English
    * nl_hearst.jape                        // JAPE grammar for Dutch
  * index                            
  * logs
  * output
* utils
  * gate-8.0                                // GATE software - binary package
  * treetagger                              // Treetagger - POS tagger for German and Dutch
* application.LANG.conf                     // settings of all modules for the set language
* run-all.sh                                // main launcher
* pom.xml

Download Gate 8 from https://gate.ac.uk/download/ (binary-only package).

Install memcached (Debian: apt-get memcached).

You can download the required datasets manually or using the Downloader module (see installation steps). If you want to download datasets manually, you will find them at the DBpedia homepage:

Download DBpedia Mapping-based Types dataset, Mapping-based Types transitive dataset, Disambiguations dataset and Short Abstracts dataset for the set language from http://wiki.dbpedia.org/Downloads to the dataset directory. Datasets must be unzipped; having .nt suffix.
Download English Inter-Language Links dataset, English Mapping-based Types dataset and English Mapping-based Types transitive dataset from http://wiki.dbpedia.org/Downloads to the dataset directory (the datasets must be unzipped).
Download DBpedia Ontology (owl) from http://wiki.dbpedia.org/Downloads and unzip it to the dataset directory.

For other languages than English you need to download TreeTagger from http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/ and install it. There is a special file in the GATE directory plugins/Tagger_Framework/resources/TreeTagger/tree-tagger-LANG-gate which must be specified and targeted to the installed TreeTagger application (this file is generated during the TreeTagger installation step in the cmd/ directory).

tree-tagger-german-gate (for German)
tree-tagger-dutch-gate (for Dutch)

Example for German (tree-tagger-german-gate):

#!/bin/sh

SCRIPT_DIR="$( cd "$( dirname "$0" )" && pwd )"

OPTIONS="-token -lemma -sgml"

BIN=$SCRIPT_DIR/../../../../../treetagger/bin
CMD=$SCRIPT_DIR/../../../../../treetagger/cmd
LIB=$SCRIPT_DIR/../../../../../treetagger/lib

TOKENIZER=${CMD}/utf8-tokenize.perl
TAGGER=${BIN}/tree-tagger
ABBR_LIST=${LIB}/german-abbreviations-utf8
PARFILE=${LIB}/german-utf8.par
LEXFILE=${LIB}/german-lexicon-utf8.txt
FILTER=${CMD}/filter-german-tags

$TOKENIZER -a $ABBR_LIST $* |
# external lexicon lookup
perl $CMD/lookup.perl $LEXFILE |
# tagging
$TAGGER $OPTIONS $PARFILE | 
# error correction
$FILTER

Example for Dutch (tree-tagger-dutch-gate):

#!/bin/sh

SCRIPT_DIR="$( cd "$( dirname "$0" )" && pwd )"

OPTIONS="-token -lemma -sgml"

BIN=$SCRIPT_DIR/../../../../../treetagger/bin
CMD=$SCRIPT_DIR/../../../../../treetagger/cmd
LIB=$SCRIPT_DIR/../../../../../treetagger/lib

TOKENIZER=${CMD}/utf8-tokenize.perl
TAGGER=${BIN}/tree-tagger
ABBR_LIST=${LIB}/dutch-abbreviations
PARFILE=${LIB}/dutch-utf8.par

$TOKENIZER -a $ABBR_LIST $* |
# tagging
$TAGGER $OPTIONS $PARFILE

Docker

You can also use a docker build script for the creation of a docker image containing the current LHD extraction framework with all required dependencies and then to run an extraction process by one docker RUN command. See the docker directory

Getting started

Before starting the extraction process, the config file should be specified, see Installation and Modules paragraphs.

It is possible to use a shell script "run-all.sh" for starting all processes which are needed to generate the LHD dataset. This script fetches the current version of the LHD extraction framework by the git command, installs it with the maven command, downloads the required datasets, removes old output files and launches the extraction process. This process can take several days therefore it should be run as a background process:

./run-all.sh ../application.LANG.conf > output.log 2>&1 &

Or you can use the Pipeline module where all the computational processes are integrated. Go to the Pipeline directory and run all with one maven command (this module doesn't involve the dataset download step; so download datasets manually or by using the download module):

mvn scala:run -DaddArgs="../application.LANG.conf|<skipped-tasks>|<remove-all>" > output.log 2>&1 &

Within this command you can use some optional parameters:

remove-all: if you use "remove-all" string as a second or third parameter, then the output directory will be completely cleaned before running of the extraction process.

mvn scala:run -DaddArgs="../application.LANG.conf|remove-all"

skipped-tasks: there are some special flags which can be used for skipping of some extraction tasks. Within this option you can use any combination of these flags.
- x: Skip the indexing task of the hypernym extraction process
- e: Skip the hypernym extraction process
- y: Skip the indexing task of the ontology cleanup
- c: Skip the ontology cleanup task
- z: Skip the indexing task of the STI algorithm
- y: Skip the STI processing (Statistical Type Inferrence)
- f: Skip the final datasets making tasks (this task aggregates outputs from all modules)

mvn scala:run -DaddArgs="../application.LANG.conf|-xe"    # run all tasks except the hypernym extraction process with indexing

If some task fails then the process will continue where it left off after restart, unless you use "remove-all" parameter.

Moreover you can launch the extraction process step by step. See following paragraphs.

Installation

Go to the root directory and type these Maven commands:

mvn clean
mvn install

After that, check the main config file. You have to input the absolute or relative path to the key directories; any relative path begins in some used module; therefore the prefix ../ is needed to get into the root LHD directory:

Example of the main config file (for EN):

LHD {
  output.dir = "../data/output"                                            # the output directory where all output files will be saved
  datasets.dir = "../data/datasets"                                        # the dataset directory
  lang = "en"                                                              # a set language (en|de|nl)
  dbpedia.version = "2015-10"                                              # DBpedia version
  HypernymExtractor {
      index-dir = "../data/index"                                          # path to the directory where indexed datasets
      wiki-api = "https://en.wikipedia.org/w/"                             # Wiki Search API URL. You can use your own mirror located in your localhost which is not limited, or

LinkedHypernymsDataset

Install / Use

README

LinkedHypernymsDataset

Requirements

Preparation

Docker

Getting started

Installation