T-NER: An All-Round Python Library for Transformer-based Named Entity Recognition

T-NER is a Python tool for language model finetuning on named-entity-recognition (NER) implemented in pytorch, available via pip. It has an easy interface to finetune models and test on cross-domain and multilingual datasets. T-NER currently integrates high coverage of publicly available NER datasets and enables an easy integration of custom datasets. All models finetuned with T-NER can be deployed on our web app for visualization. Our paper demonstrating T-NER has been accepted to EACL 2021. All the models and datasets are shared via T-NER HuggingFace group.

NEW (September 2022): We released new NER dataset based on Twitter tweetner7 and the paper got accepted by AACL 2022 main conference! We release the dataset along with fine-tuned models, and more details can be found at the paper, repository and dataset page. The Twitter NER model has also been integrated into TweetNLP, and a demo is available here.

Resources: MODEL_CARD, DATASET_CARD, Gradio Online DEMO
HuggingFace: https://huggingface.co/tner
GitHub: https://github.com/asahi417/tner
Papers
- T-NER (EACL2021): acl anthology, arxiv
- TweetNER7 (AACL 2022): arxiv

Install tner via pip to get started!

pip install tner

Dataset
1.1 Preset Dataset
1.2 Custom Dataset
Model
Fine-Tuning Language Model on NER
Evaluating NER Model
Web API
Colab Examples
Reference

Google Colab Examples

| Description | Link | |---------------------------|-------| | Model Finetuning & Evaluation | | | Model Prediction | | | Multilingual NER Workflow | |

Dataset

An NER dataset contains a sequence of tokens and tags for each split (usually train/validation/test),

{
    'train': {
        'tokens': [
            ['@paulwalk', 'It', "'s", 'the', 'view', 'from', 'where', 'I', "'m", 'living', 'for', 'two', 'weeks', '.', 'Empire', 'State', 'Building', '=', 'ESB', '.', 'Pretty', 'bad', 'storm', 'here', 'last', 'evening', '.'],
            ['From', 'Green', 'Newsfeed', ':', 'AHFA', 'extends', 'deadline', 'for', 'Sage', 'Award', 'to', 'Nov', '.', '5', 'http://tinyurl.com/24agj38'], ...
        ],
        'tags': [
            [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
            [0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], ...
        ]
    },
    'validation': ...,
    'test': ...,
}

with a dictionary to map a label to its index (label2id) as below.

{"O": 0, "B-ORG": 1, "B-MISC": 2, "B-PER": 3, "I-PER": 4, "B-LOC": 5, "I-ORG": 6, "I-MISC": 7, "I-LOC": 8}

Preset Dataset

A variety of public NER datasets are available on our HuggingFace group, which can be used as below (see DATASET CARD for full dataset lists).

from tner import get_dataset
data, label2id = get_dataset(dataset="tner/wnut2017")

User can specify multiple datasets to get a concatenated dataset.

data, label2id = get_dataset(dataset=["tner/conll2003", "tner/ontonotes5"])

In concatenated datasets, we use the unified label set to unify the entity label. The idea is to share all the available NER datasets on the HuggingFace in a unified format, so let us know if you want any NER datasets to be added there!

Custom Dataset

To go beyond the public datasets, users can use their own datasets by formatting them into the IOB format described in CoNLL 2003 NER shared task paper, where all data files contain one word per line with empty lines representing sentence boundaries. At the end of each line there is a tag which states whether the current word is inside a named entity or not. The tag also encodes the type of named entity. Here is an example sentence:

EU B-ORG
rejects O
German B-MISC
call O
to O
boycott O
British B-MISC
lamb O
. O

Words tagged with O are outside of named entities and the I-XXX tag is used for words inside a named entity of type XXX. Whenever two entities of type XXX are immediately next to each other, the first word of the second entity will be tagged B-XXX in order to show that it starts another entity. Please take a look sample custom data. Those custom files can be loaded in a same way as HuggingFace dataset as below.

from tner import get_dataset
data, label2id = get_dataset(local_dataset={
    "train": "examples/local_dataset_sample/train.txt",
    "valid": "examples/local_dataset_sample/train.txt",
    "test": "examples/local_dataset_sample/test.txt"
})

Same as the HuggingFace dataset, one can concatenate dataset.

data, label2id = get_dataset(local_dataset=[
   {"train": "...", "valid": "...", "test": "..."},
   {"train": "...", "valid": "...", "test": "..."}
   ]
)

Model

T-NER currently has shared more than 100 NER models on HuggingFace group, as shown in the above table, which reports the major models only and see MODEL_CARD for full model lists. All the models can be used with tner as below.

from tner import TransformersNER
model = TransformersNER("tner/roberta-large-wnut2017")  # provide model alias on huggingface
output = model.predict(["Jacob Collier is a Grammy awarded English artist from London"])  # give a list of sentences (or tokenized sentence) 
print(output)
{
   'prediction': [['B-person', 'I-person', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-location']],
   'probability': [[0.9967652559280396, 0.9994561076164246, 0.9986955523490906, 0.9947081804275513, 0.6129112243652344, 0.9984312653541565, 0.9868122935295105, 0.9983410835266113, 0.9995284080505371, 0.9838910698890686]],
   'input': [['Jacob', 'Collier', 'is', 'a', 'Grammy', 'awarded', 'English', 'artist', 'from', 'London']],
   'entity_prediction': [[
       {'type': 'person', 'entity': ['Jacob', 'Collier'], 'position': [0, 1], 'probability': [0.9967652559280396, 0.9994561076164246]},
       {'type': 'location', 'entity': ['London'], 'position': [9], 'probability': [0.9838910698890686]}
    ]]
}

The model.predict takes a list of sentences and batch size batch_size optionally, and tokenizes the sentence by a half-space or the symbol specified by separator, which is returned as input in its output object. Optionally, user can tokenize the inputs beforehand with any tokenizer (spacy, nltk, etc) and the prediction will follow the tokenization.

output = model.predict([["Jacob Collier", "is", "a", "Grammy awarded", "English artist", "from", "London"]])
print(output)
{
    'prediction': [['B-person', 'O', 'O', 'O', 'O', 'O', 'B-location']],
    'probability': [[0.9967652559280396, 0.9986955523490906, 0.9947081804275513, 0.6129112243652344, 0.9868122935295105, 0.9995284080505371, 0.9838910698890686]],
    'input': [['Jacob Collier', 'is', 'a', 'Grammy awarded', 'English artist', 'from', 'London']],
    'entity_prediction': [[
        {'type': 'person', 'entity': ['Jacob Collier'], 'position': [0], 'probability': [0.9967652559280396]},
        {'type': 'location', 'entity': ['London'], 'position': [6], 'probability': [0.9838910698890686]}
    ]]
}

A local model checkpoint can be specified instead of model alias TransformersNER("path-to-checkpoint"). Script to re-produce those released models is here.

command-line tool

Following command-line tool is available for model p

Tner

Install / Use

README