Description

reference pytorch code for named entity tagging.

embedding
- word : GloVe, BERT, DistilBERT, mDistilBERT, MiniLM, feature-based BERT using DSA(Dynamic Self Attention) pooling, SpanBERT, ALBERT, RoBERTa, XLM-RoBERTa, BART, ELECTRA, DeBERTa, ELMo
- character : CNN
- pos : lookup
encoding
- BiLSTM
- DenseNet
  - Dynamic Self-Attention: Computing Attention over Words Dynamically for Sentence Embedding
  - a slightly modified DenseNet for longer dependency.
- Multi-Head Attention
decoding
- Softmax, CRF
related: reference pytorch code for intent(sentence) classification
document context for BERT paper reproduction
- see : https://github.com/dsindex/ntagger/issues/4#issuecomment-810304253
multi-task learning
- joint learning of sequence and token classification

Requirements

python >= 3.6
pip install -r requirements.txt

Data

CoNLL 2003 (English)

from etagger, CrossWeigh

data/conll2003

data/conll++

since CoNLL++/test.txt has incorrect chunk tags, combine it with original CoNLL2003/test.txt

$ python combine.py --conll2003 ../conll2003/test.txt --conllpp test.txt > t
$ mv t test.txt

data/conll2003_truecase, data/conll++_truecase

converting conll2003, conll++ data to its truecase

<details><summary>details</summary>

$ cd data/conll2003_truecase
$ python to-truecase.py --input_path ../conll2003/train.txt > train.txt
$ python to-truecase.py --input_path ../conll2003/valid.txt > valid.txt
$ python to-truecase.py --input_path ../conll2003/test.txt > test.txt

* same work for data/conll++

</details>

Kaggle NER (English)

from entity-annotated-corpus

data/kaggle

converting to CoNLL data format.

<details><summary>details</summary>

* download : https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus?select=ner_dataset.csv
* remove illegal characters
$ sed -e 's/5 storm/storm/' ner_dataset.csv > t ; mv t ner_dataset.csv
$ iconv -f ISO-8859-1 -t UTF-8 ner_dataset.csv > ner_dataset.csv.utf

$ python to-conll.py
$ cp -rf valid.txt test.txt

</details>

GUM (English)

from entity-recognition-datasets

data/gum

converting to CoNLL data format.

<details><summary>details</summary>

* remove '*-object', '*-abstract'
$ python to-conll.py --input_train=gum-train.conll --inpu_test=gum-test.conll --train=train.txt --valid=valid.txt --test=test.txt

</details>

Naver NER 2019 (Korean)

from HanBert-NER

data/clova2019

converted to CoNLL data format.

<details><summary>details</summary>

이기범 eoj - B-PER
한두 eoj - O
쪽을 eoj - O
먹고 eoj - O
10분 eoj - B-TIM
후쯤 eoj - I-TIM
화제인을 eoj - B-CVL
먹는 eoj - O
것이 eoj - O
좋다고 eoj - O
한다 eoj - O
. eoj - O

</details>

data/clova2019_morph

tokenized by morphological analyzer and converted to CoNLL data format.

<details><summary>details</summary>

이기범 NNP - B-PER
한두 NNP - O
쪽 NNB - O
을 X-JKO - O
먹다 VV - O
고 X-EC - O
10 SN - B-TIM
분 X-NNB - I-TIM
후 NNG - I-TIM
쯤 X-XSN - I-TIM
화제 NNG - B-CVL
인 X-NNG - I-CVL
을 X-JKO - I-CVL
먹다 VV - O
...

</details>

'X-' prefix is prepending to POS(Part of Speech) tag of inside morphs for distinguishing following morphs.
we can evaluate the predicted result morph-by-morph or eojeol by eojeol manner(every lines having 'X-' POS tag are removed).

<details><summary>details</summary>

이기범 NNP - B-PER
한두 NNP - O
쪽 NNB - O
먹다 VV - O
10 SN - B-TIM
후 NNG - I-TIM
화제 NNG - B-CVL
먹다 VV - O
...

</details>

data/clova2019_morph_space

this data is identical to data/clova2019_morph except it treats spaces as tokens.

<details><summary>details</summary>

이기범 NNP - B-PER
_ _ - O
한두 NNP - O
_ _ - O
쪽 NNB - O
을 X-JKO - O
_ _ - O
먹다 VV - O
고 X-EC - O
_ _ - O
10 SN - B-TIM
분 X-NNB - I-TIM
_ _ - I-TIM
후 NNG - I-TIM
쯤 X-XSN - I-TIM
_ _ - O
화제 NNG - B-CVL
인 X-NNG - I-CVL
을 X-JKO - I-CVL
_ _ - O
먹다 VV - O
...

</details>

KMOU NER (Korean)

from KMOU NER

data/kmou2019

build train.raw, valid.raw
- data version : https://github.com/kmounlp/NER/commit/0b32e066870bda9f65cc190f5e89c2edc6cf8f6d
- same as pytorch-bert-crf-ner
- train.raw : 00002_NER.txt, ..., EXOBRAIN_NE_CORPUS_007.txt (1,425 files)
- valid.raw : EXOBRAIN_NE_CORPUS_009.txt, EXOBRAIN_NE_CORPUS_010.txt (2 files)
- apply correction rules and converting to CoNLL data format

<details><summary>details</summary>

$ cd data/kmou2019
$ python correction.py -g train.raw > t
$ python to-conll.py -g t > train.txt
  
ex)
마음	마음	NNG	B-POH
’	’	SS	O
에	에	JKB	O
_	_	_	O
담긴	담기+ㄴ	VV+ETM	O
->
마음 NNG - B-POH
’ SS - O
에 JKB - O
_ _ - O
담기다 VV - O
ㄴ ETM - O

$ python correction.py -g valid.raw > t
$ python to-conll.py -g t > valid.txt
$ cp -rf valid.txt test.txt

</details>

data/kmou2021

data source

https://github.com/kmounlp/NER/tree/master/%EA%B0%9C%EC%B2%B4%EB%AA%85%20%EC%9D%B8%EC%8B%9D/data/example/KMOU_BIO

testa.pos.txt  testa.tags.txt  testa.words.txt  testb.pos.txt  testb.tags.txt  testb.words.txt  train.pos.txt  train.tags.txt  train.words.txt

convert to CoNLL format

$ python to-conll.py --words=train.words.txt --tags=train.tags.txt --pos=train.pos.txt > train.txt
$ python to-conll.py --words=testa.words.txt --tags=testa.tags.txt --pos=testa.pos.txt > valid.txt
$ python to-conll.py --words=testb.words.txt --tags=testb.tags.txt --pos=testb.pos.txt > test.txt

KLUE NER (Korean)

from KLUE-benchmark

data/klue

convert to CoNLL format, original, character-based

$ python to-conll.py --file klue-ner-v1.1_train.tsv > train.txt
$ python to-conll.py --file klue-ner-v1.1_dev.tsv > valid.txt
$ python to-conll.py --file klue-ner-v1.1_dev.tsv > test.txt

convert to CoNLL format, segmented by khaiii

<details><summary>details</summary>

$ python to-conll.py --file klue-ner-v1.1_train.tsv --segmentation > train.txt
$ python to-conll.py --file klue-ner-v1.1_dev.tsv --segmentation > valid.txt
$ python to-conll.py --file klue-ner-v1.1_dev.tsv --segmentation > test.txt

특히 pos pt O
_ _ pt O
영동고속도 pos pt B-LC
로 pos pt I-LC
_ _ pt O
강릉 pos pt B-LC
_ _ pt O
방향 pos pt O
_ _ pt O
문막 pos pt B-LC
휴게소 pos pt I-LC
에서 pos pt O
_ _ pt O
만종분기점 pos pt B-LC
까지 pos pt O
_ _ pt O
5 pos pt B-QT
km pos pt I-QT
_ _ pt O
구간 pos pt O
에 pos pt O
는 pos pt O
_ _ pt O
승용차 pos pt O
_ _ pt O
전용 pos pt O
_ _ pt O
임시 pos pt O
_ _ pt O
갓길차로제 pos pt O
를 pos pt O
_ _ pt O
운영 pos pt O
하 pos pt O
기 pos pt O
로 pos pt O
_ _ pt O
했다. pos pt O
...

</details>

Pretrained models

English

glove

download GloVe6B and unzip to 'embeddings' dir

$ mkdir embeddings
$ ls embeddings
glove.6B.zip
$ unzip glove.6B.zip

BERT-like models(huggingface's transformers)
SpanBERT
- pretrained SpanBERT models are compatible with huggingface's BERT modele except 'bert.pooler.dense.weight', 'bert.pooler.dense.bias'.
ELMo(allennlp)

$ cd embeddings
$ curl -OL https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x4096_512_2048cnn_2xhighway_5.5B/elmo_2x4096_512_2048cnn_2xhighway_5.5B_weights.hdf5
$ curl -OL https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x4096_512_2048cnn_2xhighway_5.5B/elmo_2x4096_512_2048cnn_2xhighway_5.5B_options.json

Korean
- description of Korean GloVe, BERT, DistilBERT, ELECTRA
  - GloVe : kor.glove.300k.300d.txt (inhouse)
  - bpe BERT : kor-bert-base-bpe.v1, kor-bert-large-bpe.v1, v3 (inhouse)
  - dha-bpe BERT : kor-bert-base-dha_bpe.v1, v3, kor-bert-large-dha_bpe.v1, v3 (inhouse)
  - dha BERT : kor-bert-base-dha.v1, v2 (inhouse)
  - MA-BERT-base : kor-bert-base-morpheme-aware (inhouse)
  - KcBERT : kcbert-base, kcbert-large
  - DistilBERT : kor-distil-bpe-bert.v1, kor-distil-dha-bert.v1, kor-distil-wp-bert.v1 (inhouse)
  - mDistilBERT : distilbert-base-multilingual-cased
  - KoELECTRA-Base : koelectra-base-v1-discriminator, koelectra-base-v3-discriminator
  - LM-KOR-ELECTRA : electra-kor-base
  - ELECTRA-base : kor-electra-bpe.v1, kor-electra-base-dhaToken1.large, kor-electra-base-dhaSyllable (inhouse)
  - RoBERTa-base : kor-roberta-base-bbpe (inhouse)
  - MA-RoBERTa-base : kor-roberta-base-morpheme-aware (inhouse)
  - XLM-RoBERTa : xlm-roberta-base, xlm-roberta-large
  - KLUE-RoBERTa : klue-roberta-base, klue-roberta-large
  - Funnel-base : funnel-kor-base
- ELMo description
  - kor_elmo_2x4096_512_2048cnn_2xhighway_1000k_weights.hdf5, kor_elmo_2x4096_512_2048cnn_2xhighway_1000k_options.json (inhouse)

CoNLL 2003 (English)

experiments summary

ntagger, measured by conlleval.pl (micro F1)

| | F1 (%) | (truecase) F1 (%) | Features | GPU / CPU | ONNX | Dynamic | Etc | | ------------------------------- | ----------------- | ----------------- | ---

Ntagger

Install / Use

README

Description

Requirements

Data

CoNLL 2003 (English)

from etagger, CrossWeigh

data/conll2003

data/conll++

data/conll2003_truecase, data/conll++_truecase

Kaggle NER (English)

from entity-annotated-corpus

data/kaggle

GUM (English)

from entity-recognition-datasets

data/gum

Naver NER 2019 (Korean)

from HanBert-NER

data/clova2019

data/clova2019_morph

data/clova2019_morph_space

KMOU NER (Korean)

from KMOU NER

data/kmou2019

data/kmou2021

KLUE NER (Korean)

from KLUE-benchmark

data/klue

Pretrained models

CoNLL 2003 (English)

experiments summary