SkillAgentSearch skills...

MultiTACRED

[ACL23] This repository contains the code for our paper "MultiTACRED: A Multilingual Version of the TAC Relation Extraction Dataset"

Install / Use

/learn @DFKI-NLP/MultiTACRED
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

MultiTACRED: A Multilingual Version of the TAC Relation Extraction Dataset

This repository contains the code of our paper: MultiTACRED: A Multilingual Version of the TAC Relation Extraction Dataset. Leonhard Hennig, Philippe Thomas, Sebastian Möller

We machine-translate the TAC relation extraction dataset [1] to 12 typologically diverse languages from different language families, analyze translation and annotation projection quality, and evaluate fine-tuned mono- and multilingual PLMs in common transfer learning scenarios.

  • HF dataset reader: https://huggingface.co/datasets/DFKI-SLT/multitacred
  • Papers With Code: https://paperswithcode.com/dataset/multitacred
  • LDC: https://catalog.ldc.upenn.edu/LDC2024T09

Access

To respect the copyright of the underlying TACRED and KBP corpora, MultiTACRED is released via the Linguistic Data Consortium (LDC). Therefore, you can download MultiTACRED from the LDC MultiTACRED webpage. If you are an LDC member, the access will be free; otherwise, an access fee of $25 is needed.

Installation


🔭  Overview

✅  Requirements

MultiTACRED is tested with:

  • Python >= 3.8
  • Torch >= 1.10.2; <= 1.12.1
  • AllenNLP >= 2.8.0; <= 2.10.1
  • Transformers >= 4.12.5; <= 4.20.1

🚀  Installation

From source

git clone https://github.com/DFKI-NLP/MultiTACRED
cd MultiTACRED
pip install .

🔧  Usage

Preparing the TACRED dataset

In order to run the translation scripts, we need to convert the files from the LDC-provided JSON format to a simpler JSONL format:

python src/translate/convert_to_jsonl.py --dataset_dir [/path/to/tacred/data/json] --output_dir ./data/en

Translation

Translation uses the DeepL or Google APIs.

translate_deepl.py and translate_google.py translate a .jsonl dataset into a different language. The dataset is expected to be in following format (i.e. the JSONL format created above):

{"id": original_id, "tokens": [original_tokens], "label": [original_label], "entities": [original_entities], "grammar": [original_grammar], "type": [original_type]}

The translated result in [output_file.jsonl] appears in the following form:

{"id": original_id, "tokens": [original_tokens], "label": [original_label], "entities": [original_entities], "grammar": [original_grammar], "type": [original_type], "language": [original_language], "tokens_translated": [translated_tokens], "entities_translated": [translated_entities], "language_translated": translated_language, "text_raw": [original_text], "translation_raw": [raw_translation_text]}

The scripts additionally create a file [output_file.jsonl].manual. This file contains all examples in which the script fails to extract the entities looking at the number and ordering of the entities. If debugging level is set to 'warning', the logger creates warnings for any such example.

The scripts skip translation for all examples that are in [output_file.jsonl] and [output_file.jsonl].manual to avoid costly unnecessary translation. Set --overwrite to re-translate those examples.

DeepL

The script translate_deepl.py translates the dataset into the target language. You need a valid API key. The following example shows how to translate the sample file data/en/train_sample.jsonl to German:

python src/translate/translate_deepl.py --api_key [API_KEY] --api_address "https://api.deepl.com/v2/translate"
-i ./data/en/train_sample.jsonl -o ./data/de/train_de_deepl.jsonl -T spacy_de -s EN -t DE
--log_file translate.log

Your output should be similar to data/de/train_de_sample.jsonl. Note that the output file contains the original English tokens (field tokens) as well as the translated tokens (field tokens_translated) and the raw translation (field translation_raw).

For testing purposes, you can use the character-limited free API endpoint https://api-free.deepl.com/v2/translate (but you still need an API key).

Call python src/translate/translate_deepl.py --help for usage and argument information.

For a list of available languages use the flag --show_languages.

Google

The script translate_google.py translates the dataset into the target language. You need a valid API key. Follow this setup guide and place the private key in a secure location.

The following example shows how to translate the sample file data/en/train_sample.jsonl to German:

python src/translate/translate_google.py --private_key [path/to/PRIVATE_KEY]
-i ./data/en/train_sample.jsonl -o ./data/de/train_de_google.jsonl -T spacy_de -s EN -t DE
--log_file translate.log

Call python src/translate/translate_google.py --help for usage and argument information.

For a list of available langugages use the flag --show_languages.

Using an .env file

For convenience, it is best to create a .env file of following content:

INPUT_FILE='/path/to/dataset.jsonl'
OUTPUT_FILE='dataset_translated.jsonl'
# For DeepL
API_KEY="xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx:xx"
API_ADDRESS="https://api.deepl.com/v2/translate"
# or use: API_ADDRESS="https://api-free.deepl.com/v2/translate" (limited to 500K chars/month)

# For Google
PRIVATE_KEY='/path/to/private_key.json'

Tokenizer

The argument -T or --tokenizer gives you a choice between which tokenizer you want to use to tokenize the raw translation text.

  • split (default): python whitespace tokenization
  • spacy website, the used language model will be downloaded automatically (currently, the statistical models, not the neural ones).
  • trankit website, the used neural model will be downloaded automatically. Requires a GPU to run at reasonable speed!

You can add your own tokenizer by just adding the tokenizer name to tokenizer_choices and its initialization in the init_tokenizer() function of the utils.py script.

Logging

The scripts implement logging from the python standard library.

  • Set -v to display logging messages to the console
  • Set --log_level [debug,info,warning,error,critical] to determine which kind things should be logged.
  • Set --log_file FILE to log to a file.
  • Alternatively give a logger object as argument to the translate function.

Backtranslation

The script 'backtranslate.py' translates a translated dataset back to its original language. The function accepts output.jsonl files in the format of the translate_deepl.py and translate_google.py file. In this case it is necessary to specify an input, output file and a service:

python src/translate/backtranslate.py [translated.jsonl] [backtranslated.jsonl] [google|deepl]

Additionally, the script accepts the same arguments (and .env file) as translate_deepl.py and translate_google.py because it uses one of both.

The script is a wrapper for a sequence of calls of scripts. It calls following scripts in this order:

  1. prepare_backtranslation.py creates a temporary file (default=.temp_[input_file]) with the right format to call the translation script on (switching field names)
  2. translate_deepl.py or translate_google.py are called on the temporary file, the outputs are found in two temporary files (default=.temp_[output_file] and .temp_[output_file].manual). Be careful about deleting them, as at it is here where all the raw backtranslation results accumulate.
  3. postpare_backtranslation.py all entries in the temporary file back into an acceptable format

Converting JSONL to TACRED JSON

If you require the translations in the orginal TACRED JSON format, e.g. when using the Huggingface Tacred DatasetReader, you can call the script:

python src/translate/convert_to_json.py --dataset_dir [/path/to/translated/jsonl] --output_dir [json-output-dir] --language [lang_code]

Scripts to wrap Translation, Backtranslation and Conversion to JSON

scripts/translate_deepl.sh and scripts/translate_google.sh wrap translation, backtranslation, and conversion to JSON for a single language. You do still need to do the one-time-only step of preparing the JSONL version of the original TACRED!

Relation Extraction Experiments

All experiments are configured with Hydra. Experiment configs are stored in config/.

Preparing the data

You need to obtain the MultiTACRED dataset from this URL, and unzip it into the ./data folder. You also need to download the original, English TACRED dataset, and place the content of its data/json folder in ./data/en

The file structure should look like this:

data
  |-- ar/
       |--- train_ar.json
       |--- dev_ar.json
       |--- test_ar.json
       |--- test_en_ar_bt.json
  |-- de
       |--- ...
  |-- en
       |--- train.json
       |--- dev.json
       |--- test.json
  |...

To reproduce our results, you should apply the TACRED Revisited patch to the TACRED json files. We provide a slightly modified version of the original apply_tacred_patch.py script to account for non-ascii characters (json dump with ensure_ascii=False) and the reduced amount of instances in dev / test due to translation errors (remove an id check assertion).

git clone https://github.com/DFKI-NLP/tacrev

Then, for any language, run:

# Dev split
python ./scripts/apply_tacrev_patch.py \
  --dataset-file ./data/[lang]/dev_[lang].json \
  --patch-file [path/to/tacrev/patch/dev_patch.json] \
  --output-file ./data/[l

Related Skills

View on GitHub
GitHub Stars10
CategoryEducation
Updated9mo ago
Forks0

Languages

Python

Security Score

87/100

Audited on Jun 30, 2025

No findings