Extend

Entity Disambiguation as text extraction (ACL 2022)

Generate Convert Improve

Install / Use

/learn @SapienzaNLP/Extend

About this skill

Quality Score

0/100

README

<h1 align ="center"> ExtEnD: Extractive Entity Disambiguation </h1> <p align="center"> <a href="https://sunglasses-ai.github.io/classy/"> <img alt="Python" src="https://img.shields.io/badge/-classy%200.2.1-black?style=for-the-badge&logoColor=white&logo=data:image/svg+xml;base64,PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0idXRmLTgiPz4NCjwhLS0gR2VuZXJhdG9yOiBBZG9iZSBJbGx1c3RyYXRvciAxNy4wLjAsIFNWRyBFeHBvcnQgUGx1Zy1JbiAuIFNWRyBWZXJzaW9uOiA2LjAwIEJ1aWxkIDApICAtLT4NCjwhRE9DVFlQRSBzdmcgUFVCTElDICItLy9XM0MvL0RURCBTVkcgMS4xLy9FTiIgImh0dHA6Ly93d3cudzMub3JnL0dyYXBoaWNzL1NWRy8xLjEvRFREL3N2ZzExLmR0ZCI+DQo8c3ZnIHZlcnNpb249IjEuMSIgaWQ9IkxpdmVsbG9fMSIgeG1sbnM9Imh0dHA6Ly93d3cudzMub3JnLzIwMDAvc3ZnIiB4bWxuczp4bGluaz0iaHR0cDovL3d3dy53My5vcmcvMTk5OS94bGluayIgeD0iMHB4IiB5PSIwcHgiDQoJIHdpZHRoPSI0MzcuNnB4IiBoZWlnaHQ9IjQxMy45NzhweCIgdmlld0JveD0iMCAwIDQzNy42IDQxMy45NzgiIGVuYWJsZS1iYWNrZ3JvdW5kPSJuZXcgMCAwIDQzNy42IDQxMy45NzgiIHhtbDpzcGFjZT0icHJlc2VydmUiPg0KPGc+DQoJPHBhdGggZmlsbD0iI0ZGQ0MwMCIgZD0iTTM5NC42NjgsMTczLjAzOGMtMS4wMTgsMC0yLjAyLDAuMDY1LTMuMDE1LDAuMTUyQzM3NS42NjUsODIuODExLDI5Ni43OSwxNC4xNDYsMjAxLjgyMywxNC4xNDYNCgkJQzk1LjMxOSwxNC4xNDYsOC45OCwxMDAuNDg1LDguOTgsMjA2Ljk4OWMwLDEwNi41MDUsODYuMzM5LDE5Mi44NDMsMTkyLjg0MywxOTIuODQzYzk0Ljk2NywwLDE3My44NDItNjguNjY1LDE4OS44MjktMTU5LjA0NA0KCQljMC45OTUsMC4wODcsMS45OTcsMC4xNTIsMy4wMTUsMC4xNTJjMTguNzUxLDAsMzMuOTUyLTE1LjIsMzMuOTUyLTMzLjk1MkM0MjguNjIsMTg4LjIzOSw0MTMuNDE5LDE3My4wMzgsMzk0LjY2OCwxNzMuMDM4eg0KCQkgTTIwMS44MjMsMzQ2Ljg2OWMtNzcuMTMsMC0xMzkuODgtNjIuNzUtMTM5Ljg4LTEzOS44OGMwLTc3LjEyOSw2Mi43NS0xMzkuODc5LDEzOS44OC0xMzkuODc5czEzOS44OCw2Mi43NSwxMzkuODgsMTM5Ljg3OQ0KCQlDMzQxLjcwMywyODQuMTE5LDI3OC45NTMsMzQ2Ljg2OSwyMDEuODIzLDM0Ni44Njl6Ii8+DQoJPGc+DQoJCTxwYXRoIGZpbGw9IiNGRkNDMDAiIGQ9Ik0xMTQuOTA3LDIzMy40NzNjLTE0LjYyNiwwLTI2LjQ4My0xMS44NTYtMjYuNDgzLTI2LjQ4M2MwLTYyLjUyOCw1MC44NzEtMTEzLjQwMiwxMTMuMzk4LTExMy40MDINCgkJCWMxNC42MjYsMCwyNi40ODMsMTEuODU2LDI2LjQ4MywyNi40ODNzLTExLjg1NiwyNi40ODMtMjYuNDgzLDI2LjQ4M2MtMzMuMzI0LDAtNjAuNDMzLDI3LjExMi02MC40MzMsNjAuNDM2DQoJCQlDMTQxLjM5LDIyMS42MTcsMTI5LjUzNCwyMzMuNDczLDExNC45MDcsMjMzLjQ3M3oiLz4NCgk8L2c+DQo8L2c+DQo8L3N2Zz4NCg=="> </a> <a href=""> <img alt="Python" src="https://img.shields.io/badge/Python 3.8--3.9-blue?style=for-the-badge&logo=python&logoColor=white"> </a> <a href="https://pytorch.org/get-started/locally/"> <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch 1.9-ee4c2c?style=for-the-badge&logo=pytorch&logoColor=white"> </a> <a href="https://spacy.io/"> <img alt="plugin: spacy" src="https://img.shields.io/badge/plugin%20for-spaCy%203.2-09A3D5.svg?style=for-the-badge&labelColor=gray"> </a> <a href="https://black.readthedocs.io/en/stable/"> <img alt="Code style: black" src="https://img.shields.io/badge/code%20style-black-black.svg?style=for-the-badge&labelColor=gray"> </a> </p>

This repository contains the code of ExtEnD: Extractive Entity Disambiguation, a novel approach to Entity Disambiguation (i.e. the task of linking a mention in context with its most suitable entity in a reference knowledge base) where we reformulate this task as a text extraction problem. This work was accepted at ACL 2022.

If you find our paper, code or framework useful, please reference this work in your paper:

@inproceedings{barba-etal-2021-extend,
    title = "{E}xt{E}n{D}: Extractive Entity Disambiguation",
    author = "Barba, Edoardo  and
      Procopio, Luigi  and
      Navigli, Roberto",
    booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics",
    month = may,
    year = "2022",
    address = "Online and Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
}

ExtEnD Image

ExtEnD is built on top of the classy library. If you are interested in using this project, we recommend checking first its introduction, although it is not strictly required to train and use the models.

Finally, we also developed a few additional tools that make it simple to use and test ExtEnD models:

a very simple custom component for spaCy
a demo on HuggingFace Spaces
a docker image running two services, a streamlit demo and a REST service

Setup the environment

Requirements:

Debian-based (e.g. Debian, Ubuntu, ...) system
conda installed

To quickly setup the environment to use ExtEnd/replicate our experiments, you can use the bash script setup.sh. The only requirements needed here is to have a Debian-based system (Debian, Ubuntu, ...) and conda installed.

bash setup.sh

Checkpoints

We release the following checkpoints:

| Model | Training Dataset | Avg Score | |:-----------------------------------------------------------------------------------------------:|:--:|:------------:| | Longformer Large | AIDA | 85.8 |

Once you have downloaded the files, untar them inside the experiments/ folder.

# move file to experiments folder
mv ~/Downloads/extend-longformer-large.tar.gz experiments/
# untar
tar -xf experiments/extend-longformer-large.tar.gz -C experiments/
rm experiments/extend-longformer-large.tar.gz

Data

All the datasets used to train and evaluate ExtEnD can be downloaded using the following script from the facebook GENRE repository.

We strongly recommend you organize them in the following structure under the data folder as it is used by several scripts in the project.

data
├── aida
│   ├── test.aida
│   ├── train.aida
│   └── validation.aida
└── out_of_domain
    ├── ace2004-test-kilt.ed
    ├── aquaint-test-kilt.ed
    ├── clueweb-test-kilt.ed
    ├── msnbc-test-kilt.ed
    └── wiki-test-kilt.ed

Training

To train a model from scratch, you just have to use the following command:

classy train qa <folder> -n my-model-name --profile aida-longformer-large-gam -pd extend

<folder> can be any folder containing exactly 3 files:

train.aida
validation.aida
test.aida

This is required to let classy automatically discover the dataset splits. For instance, to re-train our AIDA-only model:

classy train data/aida -n my-model-name --profile aida-longformer-large-gam -pd extend

Note that <folder> can be any folder, as long as:

it contains these 3 files
they are in the same format as the files in data/aida

So if you want to train on these different datasets, just create the corresponding directory and you are ready to go!

In case you want to modify some training hyperparameter, you just have to edit the aida-longformer-large-gam profile in the configurations/ folder. You can take a look to the modifiable parameters by adding the parameter --print to the training command. You can find more on this in classy official documentation.

Predict

You can use classy syntax to perform file prediction:

classy predict -pd extend file \
    experiments/extend-longformer-large \
    data/aida/test.aida \
    -o data/aida_test_predictions.aida

Evaluation

To evaluate a checkpoint, you can run the bash script scripts/full_evaluation.sh, passing its path as an input argument. This will evaluate the model provided against both AIDA and OOD resources.

# syntax: bash scripts/full_evaluation.sh <ckpt-path>
bash scripts/full_evaluation.sh experiments/extend-longformer-large/2021-10-22/09-11-39/checkpoints/best.ckpt

If you are interested in AIDA-only evaluation, you can use scripts/aida_evaluation.sh instead (same syntax).

Furthermore, you can evaluate the model on any dataset that respects the same format of the original ones with the following command:

classy evaluate \
    experiments/extend-longformer-large/2021-10-22/09-11-39/checkpoints/best.ckpt \
    data/aida/test.aida \
    -o data/aida_test_evaluation.txt \
    -pd extend

spaCy

You can also use ExtEnD with spaCy, allowing you to use our system with a seamless interface that tackles full end-to-end entity linking. To do so, you just need to have cloned the repo and run setup.sh to configure the environment. Then, you will be able to add extend as a custom component in the following way:

import spacy
from extend import spacy_component

nlp = spacy.load("en_core_web_sm")

extend_config = dict(
    checkpoint_path="<ckpt-path>",
    mentions_inventory_path="<inventory-path>",
    device=0,
    tokens_per_batch=4000,
)

nlp.add_pipe("extend", after="ner", config=extend_config)

input_sentence = "Japan began the defence of their title " \
                 "with a lucky 2-1 win against Syria " \
                 "in a championship match on Friday."

doc = nlp(input_sentence)

# [(Japan, Japan National Footbal Team), (Syria, Syria National Footbal Team)]
disambiguated_entities = [(ent.text, ent._.disambiguated_entity) for ent in doc.ents]

Where:

<ckpt-path> is the path to a pretrained checkpoint of extend that you can find in the Checkpoints section, and
<inventory-path> is the path to a file containing the mapping from mentions to the corresponding candidates.

We support two formats f

Related Skills

node-connect

352.0k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

111.1k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

352.0k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

352.0k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。