DEREK – Domain Entities and Relations Extraction Kit

Goals and Tasks

This project's main focus is to provide semi-supervised models for solving information extraction tasks. Given a collection of unlabeled texts and a set of texts labeled by entities and relations the tool is expected to extract corresponding entities and relations from any in-domain text automatically. These tasks are critical for such learning facilitation activities like knowledge base construction, question-answering, etc.

NER – Named Entity Recognition

This task's main concern is to find mentions of predefined types entities. For example, sentence:

Moscow is the major political, economic, cultural, and scientific center of Russia and Eastern Europe, as well as the largest city entirely on the European continent.

mentions 4 entities of type Location.

NET – Named Entity Typing

This task is focused around determining more detailed type of some typed entity. For example, in sentence:

Moscow is the capital and most populous city of Russia

mention Moscow of type Location could be assigned a more fine grained type City or even Capital.

RelExt – Relation Extraction

This task is revolved around relations of several entities and their types. For example, by considering a sentence:

Daniel ruled Moscow as Grand Duke until 1303 and established it as a prosperous city.

we could determine that entities Daniel and Moscow of types Person and Location respectively are related with type Leader of.

Technical Overview

Structure

DEREK project is written in Python and consists of:

derek runtime library which includes:
1. data model (data/model module and some other modules in data package)
2. Tensorflow based implementations of neural networks for all tasks (ner, net and rel_ext packages)
3. task-agnostic helpers code to simplify models development (common package)
4. integrations with external preprocessors (tokenizers, POS taggers, syntactic parsers) such as NLTK, UDPipe, TDozat and internal ones such as Texterra and BabylonDigger (some modules in data package)
5. readers for widely used corpora formats such as BRAT, BioNLP, BioCreative (data/readers module)
unit and integration tests suite for runtime library
tools suite which consists of:
1. preprocessing scripts (generate_dataset, segment and many other modules)
2. evaluation scripts (param_search, evaluate and many other modules)
3. simple HTTP server based on Flask for model exploitation use cases
derek-demo application based on MongoDB containing:
1. processors for processing crawled texts with some model (through a server)
2. simple ui based on Flask and Jinja for model results presentation and analysis

Abstractions

Document (data/model): tokenized text with some data attached to it:
1. TokenSpan: span (range) of tokens
2. Sentence (Document.sentences): TokenSpan describing sentence boundaries
3. Entity (Document.entities): TokenSpan with a type attached describing entity mention
4. Relation (Document.relations): pair of Entity instances with a type attached
5. Document.token_features: each key in dict represents some additional information in form of a list with a value for each token
6. Document.extras: each key in dict represents some additional information in a free form
DocumentTransformer (data/transformers): interface for classes which somehow transform Document to get another Document
AbstractVectorizer (data/vectorizers): parent for classes which somehow provide feature vectors (neural network driven transfer learning) for each token in Document (e.g., FastTextVectorizer)
FE – feature extractor: classes which extract features from objects (e.g., Entity in context of Document) in a format suitable for machine learning
graph factory, or simply graph: computational graph of a neural network which contains mathematical operations converting provided features into predictions, the most research-intensive part of the system
features meta: classes which contain meta information about features (e.g., dimensionality) extracted by some FE, they form bridges between FE and graph parts of algorithm
classifier and trainer: classes which wire FE with graph factory and perform all required integration actions

How to install DEREK library and tools?

To install DEREK library and tools you should:

clone this Git repository: git clone ...
cd derek
fetch submodules of the repo: git submodule update --init --recursive
install Python 3.6
install DEREK library dependencies: pip3 install -r requirements.txt
install DEREK tools dependencies: pip3 install -r tools/requirements.txt

Pretrained models and resources

All images contains code, resources (if licence allows it) and training properties for results reproducing. Default entrypoint runs server for provided model, you only need to bind container's port 5000 to your local to get things done.

CoNLL-03 NER

Reproducing Ma and Hovy (2016) results.

DockerImage (on DockerHub) with best model -- trifonovispras/derek-images:derek-ner-conll03

| model | dev F1 | test F1 | |:----------------------------------------------------------------------------------: |:------: |:-------: | | DEREK (10 seeds) | 0.9488 | 0.9101 | | SOTA without LM [Wu et al., 2018] (5 seeds) | 0.9487 | 0.9189 | | SOTA with LM [Baevski et al., 2019] (1 seed) | 0.9690 | 0.9350 |

ImageNote: UDPipe english-ud-2.0-170801.udpipe model is used for segmentation on server (you can change it by storing other model and specifying other segmenter arguments when running an image).

Ontonotes v5 NER

Standard BiLSTM-CNN-CRF architecture with GloVe embeddings.

DockerImage (on DockerHub) with best model -- trifonovispras/derek-images:derek-ner-ontonotes

| model | dev F1 | test F1 | |:--------------------------------------------------------------------------------------------: |:------: |:-------: | | DEREK (5 seeds) | 0.8679 | 0.8506 | | SOTA without LM [Ghaddar and Langlais, 2018] (5 seeds) | 0.8644 | 0.8795 | | SOTA with LM [Akbik et al., 2018] (2 seeds) | - | 0.8971 |

ImageNote: UDPipe english-ud-2.0-170801.udpipe model is used for segmentation on server (you can change it by storing other model and specifying other segmenter arguments when running an image).

FactRuEval-2016 NER

Standard BiLSTM-CNN-CRF architecture with word2vec embeddings.

DockerImage (on DockerHub) with best model -- trifonovispras/derek-images:derek-ner-factrueval

ImageNote: NLTK is used for segmentation on server (you can change it by storing other model and specifying other segmenter arguments when running an image).

ChemProt RelExt

2-layered encoder model with postprocessed NLTK segmentation with/without SDP-multitask learning on GENIA corpus.

DockerImage (on DockerHub) with best model -- trifonovispras/derek-images:derek-relext-chemprot

| model | dev F1 | test F1 | |:---------------------------------------------------------------------------------------------: |:------: |:-------: | | DEREK without multitask (10 seeds) | 0.6346 | 0.6245 | | DEREK with multitask (10 seeds) | 0.6582 | 0.6411 | | SOTA without LM [Peng et al., 2018] (1 seed, train+dev) | - | 0.6410 | | SOTA with LM [Peng et al., 2019] (1 seed) | - | 0.7440 |

ImageNote: NLTK with postprocessing is used for segmentation on server (you can change it by storing other model and specifying other segmenter arguments when running an image).

BB3 RelExt

2-layered encoder model.

DockerImage (on DockerHub) with best model -- trifonovispras/derek-images:derek-relext-bb3

| model | dev F1 | test F

Derek

Install / Use

README