Ntagger
reference pytorch code for named entity tagging
Install / Use
/learn @dsindex/NtaggerREADME
Description
reference pytorch code for named entity tagging.
-
embedding
- word : GloVe, BERT, DistilBERT, mDistilBERT, MiniLM, feature-based BERT using DSA(Dynamic Self Attention) pooling, SpanBERT, ALBERT, RoBERTa, XLM-RoBERTa, BART, ELECTRA, DeBERTa, ELMo
- character : CNN
- pos : lookup
-
encoding
- BiLSTM
- DenseNet
- Dynamic Self-Attention: Computing Attention over Words Dynamically for Sentence Embedding
- a slightly modified DenseNet for longer dependency.
- Multi-Head Attention
-
decoding
- Softmax, CRF
-
related: reference pytorch code for intent(sentence) classification
-
document context for BERT paper reproduction
- see : https://github.com/dsindex/ntagger/issues/4#issuecomment-810304253
-
- joint learning of sequence and token classification
Requirements
-
python >= 3.6
-
pip install -r requirements.txt
Data
CoNLL 2003 (English)
from etagger, CrossWeigh
data/conll2003
data/conll++
- since CoNLL++/test.txt has incorrect chunk tags, combine it with original CoNLL2003/test.txt
$ python combine.py --conll2003 ../conll2003/test.txt --conllpp test.txt > t
$ mv t test.txt
data/conll2003_truecase, data/conll++_truecase
<details><summary>details</summary> <p>$ cd data/conll2003_truecase
$ python to-truecase.py --input_path ../conll2003/train.txt > train.txt
$ python to-truecase.py --input_path ../conll2003/valid.txt > valid.txt
$ python to-truecase.py --input_path ../conll2003/test.txt > test.txt
* same work for data/conll++
</p>
</details>
Kaggle NER (English)
from entity-annotated-corpus
data/kaggle
- converting to CoNLL data format.
* download : https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus?select=ner_dataset.csv
* remove illegal characters
$ sed -e 's/5 storm/storm/' ner_dataset.csv > t ; mv t ner_dataset.csv
$ iconv -f ISO-8859-1 -t UTF-8 ner_dataset.csv > ner_dataset.csv.utf
$ python to-conll.py
$ cp -rf valid.txt test.txt
</p>
</details>
GUM (English)
from entity-recognition-datasets
data/gum
- converting to CoNLL data format.
* remove '*-object', '*-abstract'
$ python to-conll.py --input_train=gum-train.conll --inpu_test=gum-test.conll --train=train.txt --valid=valid.txt --test=test.txt
</p>
</details>
Naver NER 2019 (Korean)
from HanBert-NER
data/clova2019
- converted to CoNLL data format.
이기범 eoj - B-PER
한두 eoj - O
쪽을 eoj - O
먹고 eoj - O
10분 eoj - B-TIM
후쯤 eoj - I-TIM
화제인을 eoj - B-CVL
먹는 eoj - O
것이 eoj - O
좋다고 eoj - O
한다 eoj - O
. eoj - O
</p>
</details>
data/clova2019_morph
- tokenized by morphological analyzer and converted to CoNLL data format.
이기범 NNP - B-PER
한두 NNP - O
쪽 NNB - O
을 X-JKO - O
먹다 VV - O
고 X-EC - O
10 SN - B-TIM
분 X-NNB - I-TIM
후 NNG - I-TIM
쯤 X-XSN - I-TIM
화제 NNG - B-CVL
인 X-NNG - I-CVL
을 X-JKO - I-CVL
먹다 VV - O
...
</p>
</details>
-
'X-' prefix is prepending to POS(Part of Speech) tag of inside morphs for distinguishing following morphs.
-
we can evaluate the predicted result morph-by-morph or eojeol by eojeol manner(every lines having 'X-' POS tag are removed).
이기범 NNP - B-PER
한두 NNP - O
쪽 NNB - O
먹다 VV - O
10 SN - B-TIM
후 NNG - I-TIM
화제 NNG - B-CVL
먹다 VV - O
...
</p>
</details>
data/clova2019_morph_space
- this data is identical to
data/clova2019_morphexcept it treats spaces as tokens.
이기범 NNP - B-PER
_ _ - O
한두 NNP - O
_ _ - O
쪽 NNB - O
을 X-JKO - O
_ _ - O
먹다 VV - O
고 X-EC - O
_ _ - O
10 SN - B-TIM
분 X-NNB - I-TIM
_ _ - I-TIM
후 NNG - I-TIM
쯤 X-XSN - I-TIM
_ _ - O
화제 NNG - B-CVL
인 X-NNG - I-CVL
을 X-JKO - I-CVL
_ _ - O
먹다 VV - O
...
</p>
</details>
KMOU NER (Korean)
from KMOU NER
data/kmou2019
- build train.raw, valid.raw
- data version : https://github.com/kmounlp/NER/commit/0b32e066870bda9f65cc190f5e89c2edc6cf8f6d
- same as pytorch-bert-crf-ner
- train.raw : 00002_NER.txt, ..., EXOBRAIN_NE_CORPUS_007.txt (1,425 files)
- valid.raw : EXOBRAIN_NE_CORPUS_009.txt, EXOBRAIN_NE_CORPUS_010.txt (2 files)
- apply correction rules and converting to CoNLL data format
$ cd data/kmou2019
$ python correction.py -g train.raw > t
$ python to-conll.py -g t > train.txt
ex)
마음 마음 NNG B-POH
’ ’ SS O
에 에 JKB O
_ _ _ O
담긴 담기+ㄴ VV+ETM O
->
마음 NNG - B-POH
’ SS - O
에 JKB - O
_ _ - O
담기다 VV - O
ㄴ ETM - O
$ python correction.py -g valid.raw > t
$ python to-conll.py -g t > valid.txt
$ cp -rf valid.txt test.txt
</p>
</details>
<br>
data/kmou2021
-
data source
- https://github.com/kmounlp/NER/tree/master/%EA%B0%9C%EC%B2%B4%EB%AA%85%20%EC%9D%B8%EC%8B%9D/data/example/KMOU_BIO
testa.pos.txt testa.tags.txt testa.words.txt testb.pos.txt testb.tags.txt testb.words.txt train.pos.txt train.tags.txt train.words.txt -
convert to CoNLL format
$ python to-conll.py --words=train.words.txt --tags=train.tags.txt --pos=train.pos.txt > train.txt
$ python to-conll.py --words=testa.words.txt --tags=testa.tags.txt --pos=testa.pos.txt > valid.txt
$ python to-conll.py --words=testb.words.txt --tags=testb.tags.txt --pos=testb.pos.txt > test.txt
KLUE NER (Korean)
from KLUE-benchmark
data/klue
- convert to CoNLL format, original, character-based
$ python to-conll.py --file klue-ner-v1.1_train.tsv > train.txt
$ python to-conll.py --file klue-ner-v1.1_dev.tsv > valid.txt
$ python to-conll.py --file klue-ner-v1.1_dev.tsv > test.txt
- convert to CoNLL format, segmented by khaiii
$ python to-conll.py --file klue-ner-v1.1_train.tsv --segmentation > train.txt
$ python to-conll.py --file klue-ner-v1.1_dev.tsv --segmentation > valid.txt
$ python to-conll.py --file klue-ner-v1.1_dev.tsv --segmentation > test.txt
특히 pos pt O
_ _ pt O
영동고속도 pos pt B-LC
로 pos pt I-LC
_ _ pt O
강릉 pos pt B-LC
_ _ pt O
방향 pos pt O
_ _ pt O
문막 pos pt B-LC
휴게소 pos pt I-LC
에서 pos pt O
_ _ pt O
만종분기점 pos pt B-LC
까지 pos pt O
_ _ pt O
5 pos pt B-QT
km pos pt I-QT
_ _ pt O
구간 pos pt O
에 pos pt O
는 pos pt O
_ _ pt O
승용차 pos pt O
_ _ pt O
전용 pos pt O
_ _ pt O
임시 pos pt O
_ _ pt O
갓길차로제 pos pt O
를 pos pt O
_ _ pt O
운영 pos pt O
하 pos pt O
기 pos pt O
로 pos pt O
_ _ pt O
했다. pos pt O
...
</p>
</details>
Pretrained models
-
English
- glove
- download GloVe6B and unzip to 'embeddings' dir
$ mkdir embeddings $ ls embeddings glove.6B.zip $ unzip glove.6B.zip - BERT-like models(huggingface's transformers)
- SpanBERT
- pretrained SpanBERT models are compatible with huggingface's BERT modele except
'bert.pooler.dense.weight', 'bert.pooler.dense.bias'.
- pretrained SpanBERT models are compatible with huggingface's BERT modele except
- ELMo(allennlp)
$ cd embeddings $ curl -OL https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x4096_512_2048cnn_2xhighway_5.5B/elmo_2x4096_512_2048cnn_2xhighway_5.5B_weights.hdf5 $ curl -OL https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x4096_512_2048cnn_2xhighway_5.5B/elmo_2x4096_512_2048cnn_2xhighway_5.5B_options.json - glove
-
Korean
- description of Korean GloVe, BERT, DistilBERT, ELECTRA
- GloVe :
kor.glove.300k.300d.txt(inhouse) - bpe BERT :
kor-bert-base-bpe.v1,kor-bert-large-bpe.v1, v3(inhouse) - dha-bpe BERT :
kor-bert-base-dha_bpe.v1, v3,kor-bert-large-dha_bpe.v1, v3(inhouse) - dha BERT :
kor-bert-base-dha.v1, v2(inhouse) - MA-BERT-base :
kor-bert-base-morpheme-aware(inhouse) - KcBERT :
kcbert-base,kcbert-large - DistilBERT :
kor-distil-bpe-bert.v1,kor-distil-dha-bert.v1,kor-distil-wp-bert.v1(inhouse) - mDistilBERT :
distilbert-base-multilingual-cased - KoELECTRA-Base :
koelectra-base-v1-discriminator,koelectra-base-v3-discriminator - LM-KOR-ELECTRA :
electra-kor-base - ELECTRA-base :
kor-electra-bpe.v1,kor-electra-base-dhaToken1.large,kor-electra-base-dhaSyllable(inhouse) - RoBERTa-base :
kor-roberta-base-bbpe(inhouse) - MA-RoBERTa-base :
kor-roberta-base-morpheme-aware(inhouse) - XLM-RoBERTa :
xlm-roberta-base,xlm-roberta-large - KLUE-RoBERTa :
klue-roberta-base,klue-roberta-large - Funnel-base :
funnel-kor-base
- GloVe :
- ELMo description
kor_elmo_2x4096_512_2048cnn_2xhighway_1000k_weights.hdf5,kor_elmo_2x4096_512_2048cnn_2xhighway_1000k_options.json(inhouse)
- description of Korean GloVe, BERT, DistilBERT, ELECTRA
CoNLL 2003 (English)
experiments summary
- ntagger, measured by conlleval.pl (micro F1)
| | F1 (%) | (truecase) F1 (%) | Features | GPU / CPU | ONNX | Dynamic | Etc | | ------------------------------- | ----------------- | ----------------- | ---
