SkillAgentSearch skills...

Ntagger

reference pytorch code for named entity tagging

Install / Use

/learn @dsindex/Ntagger

README

Description

reference pytorch code for named entity tagging.

<br>

Requirements

  • python >= 3.6

  • pip install -r requirements.txt

<br>

Data

CoNLL 2003 (English)

from etagger, CrossWeigh

data/conll2003
data/conll++
  • since CoNLL++/test.txt has incorrect chunk tags, combine it with original CoNLL2003/test.txt
$ python combine.py --conll2003 ../conll2003/test.txt --conllpp test.txt > t
$ mv t test.txt
data/conll2003_truecase, data/conll++_truecase
<details><summary>details</summary> <p>
$ cd data/conll2003_truecase
$ python to-truecase.py --input_path ../conll2003/train.txt > train.txt
$ python to-truecase.py --input_path ../conll2003/valid.txt > valid.txt
$ python to-truecase.py --input_path ../conll2003/test.txt > test.txt

* same work for data/conll++
</p> </details>

Kaggle NER (English)

from entity-annotated-corpus

data/kaggle

  • converting to CoNLL data format.
<details><summary>details</summary> <p>
* download : https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus?select=ner_dataset.csv
* remove illegal characters
$ sed -e 's/5 storm/storm/' ner_dataset.csv > t ; mv t ner_dataset.csv
$ iconv -f ISO-8859-1 -t UTF-8 ner_dataset.csv > ner_dataset.csv.utf

$ python to-conll.py
$ cp -rf valid.txt test.txt
</p> </details>

GUM (English)

from entity-recognition-datasets

data/gum

  • converting to CoNLL data format.
<details><summary>details</summary> <p>
* remove '*-object', '*-abstract'
$ python to-conll.py --input_train=gum-train.conll --inpu_test=gum-test.conll --train=train.txt --valid=valid.txt --test=test.txt
</p> </details>

Naver NER 2019 (Korean)

from HanBert-NER

data/clova2019
  • converted to CoNLL data format.
<details><summary>details</summary> <p>
이기범 eoj - B-PER
한두 eoj - O
쪽을 eoj - O
먹고 eoj - O
10분 eoj - B-TIM
후쯤 eoj - I-TIM
화제인을 eoj - B-CVL
먹는 eoj - O
것이 eoj - O
좋다고 eoj - O
한다 eoj - O
. eoj - O
</p> </details>
data/clova2019_morph
  • tokenized by morphological analyzer and converted to CoNLL data format.
<details><summary>details</summary> <p>
이기범 NNP - B-PER
한두 NNP - O
쪽 NNB - O
을 X-JKO - O
먹다 VV - O
고 X-EC - O
10 SN - B-TIM
분 X-NNB - I-TIM
후 NNG - I-TIM
쯤 X-XSN - I-TIM
화제 NNG - B-CVL
인 X-NNG - I-CVL
을 X-JKO - I-CVL
먹다 VV - O
...
</p> </details>
  • 'X-' prefix is prepending to POS(Part of Speech) tag of inside morphs for distinguishing following morphs.

  • we can evaluate the predicted result morph-by-morph or eojeol by eojeol manner(every lines having 'X-' POS tag are removed).

<details><summary>details</summary> <p>
이기범 NNP - B-PER
한두 NNP - O
쪽 NNB - O
먹다 VV - O
10 SN - B-TIM
후 NNG - I-TIM
화제 NNG - B-CVL
먹다 VV - O
...
</p> </details>
data/clova2019_morph_space
  • this data is identical to data/clova2019_morph except it treats spaces as tokens.
<details><summary>details</summary> <p>
이기범 NNP - B-PER
_ _ - O
한두 NNP - O
_ _ - O
쪽 NNB - O
을 X-JKO - O
_ _ - O
먹다 VV - O
고 X-EC - O
_ _ - O
10 SN - B-TIM
분 X-NNB - I-TIM
_ _ - I-TIM
후 NNG - I-TIM
쯤 X-XSN - I-TIM
_ _ - O
화제 NNG - B-CVL
인 X-NNG - I-CVL
을 X-JKO - I-CVL
_ _ - O
먹다 VV - O
...
</p> </details>

KMOU NER (Korean)

from KMOU NER

data/kmou2019
  • build train.raw, valid.raw
    • data version : https://github.com/kmounlp/NER/commit/0b32e066870bda9f65cc190f5e89c2edc6cf8f6d
    • same as pytorch-bert-crf-ner
    • train.raw : 00002_NER.txt, ..., EXOBRAIN_NE_CORPUS_007.txt (1,425 files)
    • valid.raw : EXOBRAIN_NE_CORPUS_009.txt, EXOBRAIN_NE_CORPUS_010.txt (2 files)
    • apply correction rules and converting to CoNLL data format
<details><summary>details</summary> <p>
$ cd data/kmou2019
$ python correction.py -g train.raw > t
$ python to-conll.py -g t > train.txt
  
ex)
마음	마음	NNG	B-POH
’	’	SS	O
에	에	JKB	O
_	_	_	O
담긴	담기+ㄴ	VV+ETM	O
->
마음 NNG - B-POH
’ SS - O
에 JKB - O
_ _ - O
담기다 VV - O
ㄴ ETM - O

$ python correction.py -g valid.raw > t
$ python to-conll.py -g t > valid.txt
$ cp -rf valid.txt test.txt
</p> </details> <br>
data/kmou2021
  • data source

    • https://github.com/kmounlp/NER/tree/master/%EA%B0%9C%EC%B2%B4%EB%AA%85%20%EC%9D%B8%EC%8B%9D/data/example/KMOU_BIO
    testa.pos.txt  testa.tags.txt  testa.words.txt  testb.pos.txt  testb.tags.txt  testb.words.txt  train.pos.txt  train.tags.txt  train.words.txt 
    
  • convert to CoNLL format

$ python to-conll.py --words=train.words.txt --tags=train.tags.txt --pos=train.pos.txt > train.txt
$ python to-conll.py --words=testa.words.txt --tags=testa.tags.txt --pos=testa.pos.txt > valid.txt
$ python to-conll.py --words=testb.words.txt --tags=testb.tags.txt --pos=testb.pos.txt > test.txt

KLUE NER (Korean)

from KLUE-benchmark

data/klue
  • convert to CoNLL format, original, character-based
$ python to-conll.py --file klue-ner-v1.1_train.tsv > train.txt
$ python to-conll.py --file klue-ner-v1.1_dev.tsv > valid.txt
$ python to-conll.py --file klue-ner-v1.1_dev.tsv > test.txt
  • convert to CoNLL format, segmented by khaiii
<details><summary>details</summary> <p>
$ python to-conll.py --file klue-ner-v1.1_train.tsv --segmentation > train.txt
$ python to-conll.py --file klue-ner-v1.1_dev.tsv --segmentation > valid.txt
$ python to-conll.py --file klue-ner-v1.1_dev.tsv --segmentation > test.txt

특히 pos pt O
_ _ pt O
영동고속도 pos pt B-LC
로 pos pt I-LC
_ _ pt O
강릉 pos pt B-LC
_ _ pt O
방향 pos pt O
_ _ pt O
문막 pos pt B-LC
휴게소 pos pt I-LC
에서 pos pt O
_ _ pt O
만종분기점 pos pt B-LC
까지 pos pt O
_ _ pt O
5 pos pt B-QT
km pos pt I-QT
_ _ pt O
구간 pos pt O
에 pos pt O
는 pos pt O
_ _ pt O
승용차 pos pt O
_ _ pt O
전용 pos pt O
_ _ pt O
임시 pos pt O
_ _ pt O
갓길차로제 pos pt O
를 pos pt O
_ _ pt O
운영 pos pt O
하 pos pt O
기 pos pt O
로 pos pt O
_ _ pt O
했다. pos pt O
...
</p> </details>

Pretrained models

  • English

    • glove
      $ mkdir embeddings
      $ ls embeddings
      glove.6B.zip
      $ unzip glove.6B.zip 
      
    • BERT-like models(huggingface's transformers)
    • SpanBERT
      • pretrained SpanBERT models are compatible with huggingface's BERT modele except 'bert.pooler.dense.weight', 'bert.pooler.dense.bias'.
    • ELMo(allennlp)
    $ cd embeddings
    $ curl -OL https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x4096_512_2048cnn_2xhighway_5.5B/elmo_2x4096_512_2048cnn_2xhighway_5.5B_weights.hdf5
    $ curl -OL https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x4096_512_2048cnn_2xhighway_5.5B/elmo_2x4096_512_2048cnn_2xhighway_5.5B_options.json
    
  • Korean

    • description of Korean GloVe, BERT, DistilBERT, ELECTRA
      • GloVe : kor.glove.300k.300d.txt (inhouse)
      • bpe BERT : kor-bert-base-bpe.v1, kor-bert-large-bpe.v1, v3 (inhouse)
      • dha-bpe BERT : kor-bert-base-dha_bpe.v1, v3, kor-bert-large-dha_bpe.v1, v3 (inhouse)
      • dha BERT : kor-bert-base-dha.v1, v2 (inhouse)
      • MA-BERT-base : kor-bert-base-morpheme-aware (inhouse)
      • KcBERT : kcbert-base, kcbert-large
      • DistilBERT : kor-distil-bpe-bert.v1, kor-distil-dha-bert.v1, kor-distil-wp-bert.v1 (inhouse)
      • mDistilBERT : distilbert-base-multilingual-cased
      • KoELECTRA-Base : koelectra-base-v1-discriminator, koelectra-base-v3-discriminator
      • LM-KOR-ELECTRA : electra-kor-base
      • ELECTRA-base : kor-electra-bpe.v1, kor-electra-base-dhaToken1.large, kor-electra-base-dhaSyllable (inhouse)
      • RoBERTa-base : kor-roberta-base-bbpe (inhouse)
      • MA-RoBERTa-base : kor-roberta-base-morpheme-aware (inhouse)
      • XLM-RoBERTa : xlm-roberta-base, xlm-roberta-large
      • KLUE-RoBERTa : klue-roberta-base, klue-roberta-large
      • Funnel-base : funnel-kor-base
    • ELMo description
      • kor_elmo_2x4096_512_2048cnn_2xhighway_1000k_weights.hdf5, kor_elmo_2x4096_512_2048cnn_2xhighway_1000k_options.json (inhouse)
<br>

CoNLL 2003 (English)

experiments summary

  • ntagger, measured by conlleval.pl (micro F1)

| | F1 (%) | (truecase) F1 (%) | Features | GPU / CPU | ONNX | Dynamic | Etc | | ------------------------------- | ----------------- | ----------------- | ---

View on GitHub
GitHub Stars87
CategoryEducation
Updated4mo ago
Forks12

Languages

Python

Security Score

82/100

Audited on Nov 19, 2025

No findings