SkillAgentSearch skills...

LREBench

[EMNLP 2022 Findings] Towards Realistic Low-resource Relation Extraction: A Benchmark with Empirical Baseline Study

Install / Use

/learn @zjunlp/LREBench

README

LREBench: A low-resource relation extraction benchmark.

This repo is official implementation for the EMNLP2022 (Findings) paper LREBench: Towards Realistic Low-resource Relation Extraction: A Benchmark with Empirical Baseline Study [poster].

This paper presents an empirical study to build relation extraction systems in low-resource settings. Based upon recent PLMs, three schemes are comprehensively investigated to evaluate the performance in low-resource settings: $(i)$ different types of prompt-based methods with few-shot labeled data; $(ii)$ diverse balancing methods to address the long-tailed distribution issue; $(iii)$ data augmentation technologies and self-training to generate more labeled in-domain data.

<div align=center> <img src="figs/intro.png" alt="intro" width=70% height=70% /> </div>

Contents

Environment

To install requirements:

>> conda create -n LREBench python=3.9
>> conda activate LREBench
>> pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cu113

Datasets

We provide 8 benchmark datasets and prompts used in our experiments.

All processed full-shot datasets can be downloaded and need to be placed in the dataset folder. The expected files of one dataset contains rel2id.json, train.json and test.json.

Normal Prompt-based Tuning

<div align=center> <img src="figs/prompt.png" alt="prompt" width=70% height=70% /> </div>

1 Initialize Answer Words

Use the command below to get answer words first.

>> python get_label_word.py --modelpath roberta-large --dataset semeval

The {modelpath}_{dataset}.pt will be saved in the dataset folder, and you need to assign the modelpath and dataset with names of the pre-trained language model and the dataset to be used before.

2 Split Datasets

We provide the sampling code for obtaining 8-shot (sample_8shot.py) , 10% (sample_10.py) datasets and the rest datasets used as unlabeled data for self-training. If there are classes with less than 8 instances, these classes are removed in training and testing sets when sampling 8-shot datasets and new_test.json and new_rel2id.json are obtained.

>> python sample_8shot.py -h
    usage: sample_8shot.py [-h] --input_dir INPUT_DIR --output_dir OUTPUT_DIR

    optional arguments:
      -h, --help            show this help message and exit
      --input_dir INPUT_DIR, -i INPUT_DIR
                            The directory of the training file.
      --output_dir OUTPUT_DIR, -o OUTPUT_DIR
                            The directory of the sampled files.
>> python sample_10.py -h
    usage: sample_10.py [-h] --input_file INPUT_FILE --output_dir OUTPUT_DIR

    optional arguments:
      -h, --help            show this help message and exit
      --input_file INPUT_FILE, -i INPUT_FILE
                            The directory of the training file.
      --output_dir OUTPUT_DIR, -o OUTPUT_DIR
                            The directory of the sampled files.

For example:

>> python sample_8.py -i dataset/semeval -o dataset/semeval/8-shot
>> cd dataset/semeval
>> mkdir 8-1
>> cp 8-shot/new_rel2id.json 8-1/rel2id.json
>> cp 8-shot/new_test.json 8-1/test.json
>> cp 8-shot/train_8_1.json 8-1/train.json
>> cp 8-shot/unlabel_8_1.json 8-1/label.json

3 Prompt-based Tuning

All running scripts for each dataset are in the scripts folder. For example, train KonwPrompt on SemEval, CMeIE and ChemProt with the following commands:

>> bash scripts/semeval.sh  # RoBERTa-large
>> bash scripts/CMeIE.sh    # Chinese RoBERTa-large
>> bash scripts/ChemProt.sh # BioBERT-large

4 Different prompts

<div align=center> <img src="figs/prompts.png" alt="prompts" width=70% height=70% /> </div>

Simply add parameters to the scripts.

Template Prompt: --use_template_words 0

Schema Prompt: --use_template_words 0 --use_schema_prompt True

PTR: refer to PTR

Balancing

<div align=center> <img src="figs/balance.png" alt="balance" width=40% height=40% /> </div>

1 Re-sampling

  • Create the re-sampled training file based on the 10% training set by resample.py.

    >> python resample.py -h
        usage: resample.py [-h] --input_file INPUT_FILE --output_dir OUTPUT_DIR --rel_file REL_FILE
    
        optional arguments:
          -h, --help            show this help message and exit
          --input_file INPUT_FILE, -i INPUT_FILE
                                The path of the training file.
          --output_dir OUTPUT_DIR, -o OUTPUT_DIR
                                The directory of the sampled files.
          --rel_file REL_FILE, -r REL_FILE
                                the path of the relation file
    

    For example,

    >> mkdir dataset/semeval/10sa-1
    >> python resample.py -i dataset/semeval/10/train10per_1.json -r dataset/semeval/rel2id.json -o dataset/semeval/sa
    >> cd dataset/semeval
    >> cp rel2id.json test.json 10sa-1/
    >> cp sa/sa_1.json 10sa-1/train.json
    

2 Re-weighting Loss

Simply add the useloss parameter to script for choosing various re-weighting loss.

For exampe: --useloss MultiFocalLoss. (chocies: MultiDSCLoss, MultiFocalLoss, GHMC_Loss, LDAMLoss)

Data Augmentation

<div align=center> <img src="figs/DA.png" alt="DA" width=70% height=70% /> </div>

1 Prepare the environment

>> pip install nlpaug nlpcda

Please follow the instructions from nlpaug and nlpcda for more information (Thanks a lot!).

2 Try different DA methods

We provide many data augmentation methods

  • English (nlpaug): TF-IDF, contextual word embedding (BERT and RoBERTa), and WordNet' Synonym (-lan==en, -d).

  • Chinese (nlpcda): Synonym (-lan==cn)

  • All DA methods can be implemented on contexts, entities and both of them (--locations).

  • Generate augmented data

    >> python DA.py -h
        usage: DA.py [-h] --input_file INPUT_FILE --output_dir OUTPUT_DIR --language {en,cn}
                      [--locations {sent1,sent2,sent3,ent1,ent2} [{sent1,sent2,sent3,ent1,ent2} ...]]
                      [--DAmethod {word2vec,TF-IDF,word_embedding_bert,word_embedding_roberta,random_swap,synonym}]
                      [--model_dir MODEL_DIR] [--model_name MODEL_NAME] [--create_num CREATE_NUM] [--change_rate CHANGE_RATE]
    
        optional arguments:
          -h, --help            show this help message and exit
          --input_file INPUT_FILE, -i INPUT_FILE
                                the training set file
          --output_dir OUTPUT_DIR, -o OUTPUT_DIR
                                The directory of the sampled files.
          --language {en,cn}, -lan {en,cn}
                                DA for English or Chinese
          --locations {sent1,sent2,sent3,ent1,ent2} [{sent1,sent2,sent3,ent1,ent2} ...], -l {sent1,sent2,sent3,ent1,ent2} [{sent1,sent2,sent3,ent1,ent2} ...]
                                List of positions that you want to manipulate
          --DAmethod {word2vec,TF-IDF,word_embedding_bert,word_embedding_roberta,random_swap,synonym}, -d {word2vec,TF-IDF,word_embedding_bert,word_embedding_roberta,random_swap,synonym}
                                Data augmentation method
          --model_dir MODEL_DIR, -m MODEL_DIR
                                the path of pretrained models used in DA methods
          --model_name MODEL_NAME, -mn MODEL_NAME
                                model from huggingface
          --create_num CREATE_NUM, -cn CREATE_NUM
                                The number of samples augmented from one instance.
          --change_rate CHANGE_RATE, -cr CHANGE_RATE
                                the changing rate of text
    

    Take context-level DA based on contextual word embedding on ChemProt for example:

    python DA.py \
        -i dataset/ChemProt/10/train10per_1.json \
        -o dataset/ChemProt/aug \
        -d word_embedding_bert \
        -mn dmis-lab/biobert-large-cased-v1.1 \
        -l sent1 sent2 sent3
    
  • Delete repeated instances and get the final augmented data

    >> python merge_dataset.py -h
    usage: merge_dataset.py [-h] [--input_files INPUT_FILES [INPUT_FILES ...]] [--output_file OUTPUT_FILE]
    
    optional arguments:
      -h, --help            show this help message and exit
      --input_files INPUT_FILES [INPUT_FILES ...], -i INPUT_FIL
    
View on GitHub
GitHub Stars34
CategoryDevelopment
Updated1y ago
Forks1

Languages

Python

Security Score

80/100

Audited on Feb 18, 2025

No findings