SkillAgentSearch skills...

Dygiepp

Span-based system for named entity, relation, and event extraction.

Install / Use

/learn @dwadden/Dygiepp
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

DyGIE++

Implements the model described in the paper Entity, Relation, and Event Extraction with Contextualized Span Representations.

Table of Contents

See the doc folder for documentation with more details on the data, model implementation and debugging, and model configuration.

Updates

October 2023: Unfortunately, AllenNLP (on which DyGIE++ is built) has been archived and is not actively maintained. Due to changes to various software packages, the unavailability of older versions, following the instructions under dependencies now raises errors when trying to install DyGIE++. I don't have bandwidth to get things updated. I'd welcome a PR to update the relevant dependencies and get things working again! See the dependencies section for more info.

December 2021: A couple nice additions thanks to PR's from contributors:

  • There is now a script to convert BRAT-formatted annotations to DyGIE. See here for more details. Thanks to @serenalotreck for this feature.
  • There are Spacy bindings for DyGIE entity and relation extraction; see the section on Spacy bindings. Thanks to @e3oroush for this feature.

April 2021: We've added data and models for the MECHANIC dataset, presented in the NAACL 2021 paper Extracting a Knowledge Base of Mechanisms from COVID-19 Papers.

You can also get the data by running bash scripts/data/get_mechanic.sh, which will put the data in data/mechanic.

After moving the models to the pretrained folder, you can make predictions like this:

allennlp predict \
  pretrained/mechanic-coarse.tar.gz \
  data/mechanic/coarse/test.json \
  --predictor dygie \
  --include-package dygie \
  --use-dataset-reader \
  --output-file predictions/covid-coarse.jsonl \
  --cuda-device 0 \
  --silent

Project status

This branch used to be named allennlp-v1, and it has been made the new master. It's compatible with new version of AllenNLP, and the model configuration process has been simplified. I'd recommend using this branch for all future work. If for some reason you need the older version of the code, it's on the branch emnlp-2019.

Unfortunately, I don't have the bandwidth at this point to add additional features. But please create a new issue if you have problems with:

  • Reproducing the results reported in the README.
  • Making predictions on a new dataset using pre-trained models.
  • Training your own model on a new dataset.

See below for guidelines on creating an issue.

There are a number of ways this code could be improved, and I'd definitely welcome pull requests. If you're interested, see contributions.md for a list of ieas.

Submit a model!

If you have a DyGIE model that you've trained on a new dataset, feel free to upload it here and I'll add it to the collection of pre-trained models.

Issues

If you're unable to run the code, feel free to create an issue. Please do the following:

  • Confirm that you've set up a Conda environement exactly as in the Dependencies section below. I can only offer support if you're running code within this environment.
  • Specify any commands you used to download pretrained models or to download / preprocess data. Please enclose the code in code blocks, for instance:
    # Download pretrained models.
    
    bash scripts/pretrained/get_dygiepp_pretrained.sh
    
  • Share the command that you ran to cause the issue, for instance:
    allennlp evaluate \
    pretrained/scierc.tar.gz \
    data/scierc/normalized_data/json/test.json \
    --cuda-device 2 \
    --include-package dygie
    
  • If you're using your own dataset, attach a minimal example of the data which, when given as input, causes the error you're seeing. This could be, for instance, a single line form a .jsonl file.
  • Include the full error message that you're getting.

Dependencies

Update (October 2023): These directions no longer work. Python 3.7 is no longer available from conda, and AllenNLP is no longer actively maintained, causing some dependencies to break. I'd welcome a PR to get things working again.

Clone this repository and navigate the the root of the repo on your system. Then execute:

conda create --name dygiepp python=3.7
pip install -r requirements.txt
conda develop .   # Adds DyGIE to your PYTHONPATH

This library relies on AllenNLP and uses AllenNLP shell commands to kick off training, evaluation, and testing.

If you run into an issue installing jsonnet, this issue may prove helpful.

Docker build

A Dockerfile is provided with the Pytorch + CUDA + CUDNN base image for a full-stack GPU install. It will create conda environments dygiepp for modeling & ace-event-preprocess for ACE05-Event preprocessing.

By default the build downloads datasets and dependencies for all tasks. This takes a long time and produces a large image, so you will want to comment out unneeded datasets/tasks in the Dockerfile.

  • Comment out unneeded task sections in Dockerfile.
  • Build container: docker build --tag dygiepp:dev <dygiepp-repo-dirpath>
  • Run the container interactively, mount this project dir to /dygiepp/: docker run --gpus all -it --ipc=host -v <dygiepp-repo-dirpath>:/dygiepp/ --name dygiepp dygiep:dev

NOTE: This Dockerfile was added in a PR from a contributor. I haven't tested it, so it's not "officially supported". More PR's are welcome, though.

Training a model

Warning about coreference resolution: The coreference code will break on sentences with only a single token. If you have these in your dataset, either get rid of them or deactivate the coreference resolution part of the model.

We rely on Allennlp train to handle model training. The train command takes a configuration file as an argument, and initializes a model based on the configuration, and serializes the traing model. More details on the configuration process for DyGIE can be found in doc/config.md.

To train a model, enter bash scripts/train.sh [config_name] at the command line, where the config_name is the name of a file in the training_config directory. For instance, to train a model using the scierc.jsonnet config, you'd enter

bash scripts/train.sh scierc

The resulting model will go in models/scierc. For more information on how to modify training configs (e.g. to change the GPU used for training), see config.md.

Information on preparing specific training datasets is below. For more information on how to create training batches that utilize GPU resources efficiently, see model.md. Hyperparameter optimization search is implemented using Optuna, see model.md.

SciERC

To train a model for named entity recognition, relation extraction, and coreference resolution on the SciERC dataset:

  • Download the data. From the top-level folder for this repo, enter bash ./scripts/data/get_scierc.sh. This will download the scierc dataset into a folder ./data/scierc
  • Train the model. Enter bash scripts/train.sh scierc.
  • To train a "lightweight" version of the model that doesn't do coreference propagation and uses a context width of 1, do bash scripts/train.sh scierc_lightweight instead. More info on why you'd want to do this in the section on making predictions.

GENIA

The steps are similar to SciERC.

  • Download the data. From the top-level folder for this repo, enter bash ./scripts/data/get_genia.sh.
  • Train the model. Enter bash scripts/train genia.
  • As with SciERC, we also offer a "lightweight" version with a context width of 1 and no coreference propagation.

ChemProt

The ChemProt corpus contains entity and relation annotations for drug / protein interaction. The ChemProt preprocessing requires a separate environment:

conda deactivate
conda create --name chemprot-preprocess python=3.7
conda activate chemprot-preprocess
pip install -r scripts/data/chemprot/requirements.txt

Then, follow these steps:

  • Get the data.
    • Run bash ./scripts/data/get_chemprot.sh. This will download the data and process it into the DyGIE input format.
      • NOTE: This is a quick-and-dirty script that skips entities whose character offsets don't align exactly with the tokenization produced by SciS
View on GitHub
GitHub Stars592
CategoryDevelopment
Updated23d ago
Forks120

Languages

Python

Security Score

95/100

Audited on Mar 17, 2026

No findings