Derek
DEREK (Domain Entities and Relations Extraction Kit)
Install / Use
/learn @ispras-texterra/DerekREADME
DEREK – Domain Entities and Relations Extraction Kit
Goals and Tasks
This project's main focus is to provide semi-supervised models for solving information extraction tasks. Given a collection of unlabeled texts and a set of texts labeled by entities and relations the tool is expected to extract corresponding entities and relations from any in-domain text automatically. These tasks are critical for such learning facilitation activities like knowledge base construction, question-answering, etc.
NER – Named Entity Recognition
This task's main concern is to find mentions of predefined types entities. For example, sentence:
Moscow is the major political, economic, cultural, and scientific center of Russia and Eastern Europe, as well as the largest city entirely on the European continent.
mentions 4 entities of type Location.
NET – Named Entity Typing
This task is focused around determining more detailed type of some typed entity. For example, in sentence:
Moscow is the capital and most populous city of Russia
mention Moscow of type Location could be assigned a more fine grained type City or even Capital.
RelExt – Relation Extraction
This task is revolved around relations of several entities and their types. For example, by considering a sentence:
Daniel ruled Moscow as Grand Duke until 1303 and established it as a prosperous city.
we could determine that entities Daniel and Moscow of types Person and Location respectively are related with type Leader of.
Technical Overview
Structure
DEREK project is written in Python and consists of:
derekruntime library which includes:- data model (
data/modelmodule and some other modules indatapackage) - Tensorflow based implementations of neural networks for all tasks (
ner,netandrel_extpackages) - task-agnostic helpers code to simplify models development (
commonpackage) - integrations with external preprocessors (tokenizers, POS taggers, syntactic parsers) such as NLTK, UDPipe, TDozat and internal ones such as Texterra and BabylonDigger (some modules in
datapackage) - readers for widely used corpora formats such as BRAT, BioNLP, BioCreative (
data/readersmodule)
- data model (
- unit and integration
testssuite for runtime library toolssuite which consists of:- preprocessing scripts (
generate_dataset,segmentand many other modules) - evaluation scripts (
param_search,evaluateand many other modules) - simple HTTP
serverbased on Flask for model exploitation use cases
- preprocessing scripts (
derek-demoapplication based on MongoDB containing:
Abstractions
Document(data/model): tokenized text with some data attached to it:TokenSpan: span (range) of tokensSentence(Document.sentences):TokenSpandescribing sentence boundariesEntity(Document.entities):TokenSpanwith atypeattached describing entity mentionRelation(Document.relations): pair ofEntityinstances with atypeattachedDocument.token_features: each key indictrepresents some additional information in form of a list with a value for each tokenDocument.extras: each key indictrepresents some additional information in a free form
DocumentTransformer(data/transformers): interface for classes which somehow transformDocumentto get anotherDocumentAbstractVectorizer(data/vectorizers): parent for classes which somehow provide feature vectors (neural network driven transfer learning) for each token inDocument(e.g.,FastTextVectorizer)- FE – feature extractor: classes which extract features from objects (e.g.,
Entityin context ofDocument) in a format suitable for machine learning - graph factory, or simply graph: computational graph of a neural network which contains mathematical operations converting provided features into predictions, the most research-intensive part of the system
- features meta: classes which contain meta information about features (e.g., dimensionality) extracted by some FE, they form bridges between FE and graph parts of algorithm
- classifier and trainer: classes which wire FE with graph factory and perform all required integration actions
How to install DEREK library and tools?
To install DEREK library and tools you should:
- clone this Git repository:
git clone ... cd derek- fetch submodules of the repo:
git submodule update --init --recursive - install Python 3.6
- install DEREK library dependencies:
pip3 install -r requirements.txt - install DEREK tools dependencies:
pip3 install -r tools/requirements.txt
Pretrained models and resources
All images contains code, resources (if licence allows it) and training properties for results reproducing. Default entrypoint runs server for provided model, you only need to bind container's port 5000 to your local to get things done.
CoNLL-03 NER
Reproducing Ma and Hovy (2016) results.
DockerImage (on DockerHub) with best model -- trifonovispras/derek-images:derek-ner-conll03
| model | dev F1 | test F1 | |:----------------------------------------------------------------------------------: |:------: |:-------: | | DEREK (10 seeds) | 0.9488 | 0.9101 | | SOTA without LM [Wu et al., 2018] (5 seeds) | 0.9487 | 0.9189 | | SOTA with LM [Baevski et al., 2019] (1 seed) | 0.9690 | 0.9350 |
ImageNote: UDPipe english-ud-2.0-170801.udpipe model is used for segmentation on server (you can change it by storing other model and specifying other segmenter arguments when running an image).
Ontonotes v5 NER
Standard BiLSTM-CNN-CRF architecture with GloVe embeddings.
DockerImage (on DockerHub) with best model -- trifonovispras/derek-images:derek-ner-ontonotes
| model | dev F1 | test F1 | |:--------------------------------------------------------------------------------------------: |:------: |:-------: | | DEREK (5 seeds) | 0.8679 | 0.8506 | | SOTA without LM [Ghaddar and Langlais, 2018] (5 seeds) | 0.8644 | 0.8795 | | SOTA with LM [Akbik et al., 2018] (2 seeds) | - | 0.8971 |
ImageNote: UDPipe english-ud-2.0-170801.udpipe model is used for segmentation on server (you can change it by storing other model and specifying other segmenter arguments when running an image).
FactRuEval-2016 NER
Standard BiLSTM-CNN-CRF architecture with word2vec embeddings.
DockerImage (on DockerHub) with best model -- trifonovispras/derek-images:derek-ner-factrueval
| model | test F1 | |:-------------------------------------------------------------------------------------------------------------: |:--------------------------------------------: | | DEREK (10 seeds) | 0.8114 (with LocOrg) 0.8464 (without LocOrg) | | SOTA without LM [FactRuEval 2016: Evaluation] (1 seed) | 0.809 (with LocOrg) 0.87 (without LocOrg) |
ImageNote: NLTK is used for segmentation on server (you can change it by storing other model and specifying other segmenter arguments when running an image).
ChemProt RelExt
2-layered encoder model with postprocessed NLTK segmentation with/without SDP-multitask learning on GENIA corpus.
DockerImage (on DockerHub) with best model -- trifonovispras/derek-images:derek-relext-chemprot
| model | dev F1 | test F1 | |:---------------------------------------------------------------------------------------------: |:------: |:-------: | | DEREK without multitask (10 seeds) | 0.6346 | 0.6245 | | DEREK with multitask (10 seeds) | 0.6582 | 0.6411 | | SOTA without LM [Peng et al., 2018] (1 seed, train+dev) | - | 0.6410 | | SOTA with LM [Peng et al., 2019] (1 seed) | - | 0.7440 |
ImageNote: NLTK with postprocessing is used for segmentation on server (you can change it by storing other model and specifying other segmenter arguments when running an image).
BB3 RelExt
2-layered encoder model.
DockerImage (on DockerHub) with best model -- trifonovispras/derek-images:derek-relext-bb3
| model | dev F1 | test F
