PolDeepNer2
An improved tool for named entity recognition for Polish based on deep learning.
Install / Use
/learn @CLARIN-PL/PolDeepNer2README
PolDeepNer2
PolDeepNer2 is an improved version of PolDeepNer. The tool is designed to recognize and categorize named entities utilizing neural networks and transfomer-based language models.
The tool is provided with a list of pre-trained models for Polish and other languages.
It contains a pre-trained model trained on the NKJP corpus which recognizes nested annotations of the following types:
Contributors
- Michał Marcińczuk marcinczuk@gmail.com
- Jarema Radom
Notebooks
<table> <tr> <td><pre>notebooks/pdn2_cpu.py</pre></td> <td>This notebook present how to install and use module API to process a raw text on CPU.</td> <td><a href="https://colab.research.google.com/github/CLARIN-PL/PolDeepNer2/blob/master/notebooks/pdn2_cpu.ipynb"> <img src="https://colab.research.google.com/assets/colab-badge.svg" title="Open In Colab"/></a></td> </tr> </table>Models
PolEval 2018 (NKJP NER model)
PolDeepNer2 achieves the SOTA results on the PolEval 2018 dataset.

[1] The model is not available. Only the evaluation results were published.
Comparision of loading and processing times
| Model | Library | Tokenizer | Model loading [s] | Preprocessing [s] | NE recognition [s] | Total [s] |
|:--------------------|:------------|----------------------|-------------------:|------------------:|-------------------:|-------:|
| Polish RoBERTa base | fairseq | - | 12.28 | 50.90 | 65.23 | 128.4 |
| HerBERT large | HuggingFace | HerbertTokenizerFast | 18.44 | 50.83 | 103.70 | 173.0 |
| HerBERT large | HuggingFace | XLMTokenizer | 18.33 | 51.42 | 177.50 | 247.3 |
- Dataset size: 1828 document (3M characters).
- GPU: RTX Titan (24 GB, 4608 CUDA cores).

Comparision of named entity recognition times for different datasets
| | Size [Million chars] | NER time [minutes] | |-----------------------------------------|---------:|---------:| | PolEval 2018 NER test dataset | 3 | 2.6 | | Monthly volume of news from Polish news portals [70 sources] | 160 | 136.9 | | Polish Wikipedia (2013 dump) | 1000 | 855.6 | | Annual volume of news from Polish news portals [70 sources] | 1920 | 1642.7 |

N82 (KPWr and CEN)
Inner-corpora evaluation
| Model | Eval | Precision | Recall | F-measure | Support | Source | |--------------------------------|--------|----------:|-------:|----------:|--------:|--------| | PolDeepNer2 (kpwr_n82_base) | KPWr | 75.02 | 77.67 | 76.32 | 4430 | | PolDeepNer2 (kpwr_n82_large) | KPWr | 77.05 | 78.79 | 77.91 | 4430 | | PolDeepNer (n82-elmo-kgr10) | KPWr | 73.97 | 75.49 | 74.72 | 4430 | link | --- | | PolDeepNer2 (cen_n82_base) | CEN | 84.64 | 85.95 | 85.29 | 1423 | | PolDeepNer2 (cen_n82_large) | CEN | 86.94 | 88.40 | 87.67 | 1423 |
Cross-corpora evaluation
| Model | Eval | Precision | Recall | F-measure | Support | |--------------------------------|--------|----------:|-------:|----------:|--------:| | PolDeepNer2 (kpwr_n82_base) | CEN | 80.90 | 81.87 | 81.38 | 1423 | | PolDeepNer2 (kpwr_n82_large) | CEN | 80.16 | 82.08 | 81.11 | 1423 | | --- | | PolDeepNer2 (cen_n82_base) | KPWr | 58.58 | 64.79 | 61.53 | 4430 | | PolDeepNer2 (cen_n82_large) | KPWr | 61.38 | 66.66 | 63.91 | 4430 |
Installation (with Conda)
Create and activate conda environment:
conda create -n pdn2 python=3.6
conda activate pdn2
Install CUDA, CuDNN and Torch:
conda install -c anaconda cudatoolkit=10.1
conda install -c anaconda cudnn
Install PolDeepNer2:
pip install https://pypi.clarin-pl.eu/packages/poldeepner2-0.5.0-py3-none-any.whl#md5=6a6131d1b3d104f0bbed87ec6969a841
Install spacy model
python -m spacy download pl_core_news_sm
Evaluation
Download evaluation dataset
wget http://mozart.ipipan.waw.pl/~axw/poleval2018/POLEVAL-NER_GOLD.json -O POLEVAL-NER_GOLD.json
Polish RoBERTa
Process the dataset:
python process_poleval.py \
--input POLEVAL-NER_GOLD.json \
--output pdn2_nkjp_roberta_base_sq.json \
--model nkjp-base-sq \
--device cuda:0
Output:
Model loading time : 12.28 second(s)
Data preprocessing time : 50.9 second(s)
Data NE recognition time : 65.23 second(s)
Total time : 128.4 second(s)
Data size: : 3.072M characters
Evaluate:
python poleval_ner_test.py \
--goldfile POLEVAL-NER_GOLD.json \
--userfile pdn2_nkjp_roberta_base_sq.json
Output:
OVERLAP precision: 0.927 recall: 0.912 F1: 0.919
EXACT precision: 0.899 recall: 0.884 F1: 0.891
Final score: 0.914
Exact TP=32971 ; FP=3709; FN=4335
HerBERT
Process the dataset:
python process_poleval.py \
--input POLEVAL-NER_GOLD.json \
--output pdn2_nkjp_herbert_large_sq.json \
--model nkjp-herbert-large-sq \
--device cuda:0
Output:
Model loading time : 18.44 second(s)
Data preprocessing time : 50.83 second(s)
Data NE recognition time : 103.7 second(s)
Total time : 173.0 second(s)
Data size: : 3.072M characters
Evaluate:
python poleval_ner_test.py \
--goldfile POLEVAL-NER_GOLD.json \
--userfile pdn2_nkjp_herbert_large_sq.json
Output:
OVERLAP precision: 0.929 recall: 0.922 F1: 0.926
EXACT precision: 0.903 recall: 0.896 F1: 0.900
Final score: 0.921
Exact TP=33433 ; FP=3596; FN=3873
Credits
- This code is based on [xlm-roberta-ner](https://github.com/mohammadKhalifa/xlm-roberta-ner
