MultiTACRED
[ACL23] This repository contains the code for our paper "MultiTACRED: A Multilingual Version of the TAC Relation Extraction Dataset"
Install / Use
/learn @DFKI-NLP/MultiTACREDREADME
MultiTACRED: A Multilingual Version of the TAC Relation Extraction Dataset
This repository contains the code of our paper: MultiTACRED: A Multilingual Version of the TAC Relation Extraction Dataset. Leonhard Hennig, Philippe Thomas, Sebastian Möller
We machine-translate the TAC relation extraction dataset [1] to 12 typologically diverse languages from different language families, analyze translation and annotation projection quality, and evaluate fine-tuned mono- and multilingual PLMs in common transfer learning scenarios.
- HF dataset reader: https://huggingface.co/datasets/DFKI-SLT/multitacred
- Papers With Code: https://paperswithcode.com/dataset/multitacred
- LDC: https://catalog.ldc.upenn.edu/LDC2024T09
Access
To respect the copyright of the underlying TACRED and KBP corpora, MultiTACRED is released via the Linguistic Data Consortium (LDC). Therefore, you can download MultiTACRED from the LDC MultiTACRED webpage. If you are an LDC member, the access will be free; otherwise, an access fee of $25 is needed.
Installation
🔭 Overview
✅ Requirements
MultiTACRED is tested with:
- Python >= 3.8
- Torch >= 1.10.2; <= 1.12.1
- AllenNLP >= 2.8.0; <= 2.10.1
- Transformers >= 4.12.5; <= 4.20.1
🚀 Installation
From source
git clone https://github.com/DFKI-NLP/MultiTACRED
cd MultiTACRED
pip install .
🔧 Usage
Preparing the TACRED dataset
In order to run the translation scripts, we need to convert the files from the LDC-provided JSON format to a simpler JSONL format:
python src/translate/convert_to_jsonl.py --dataset_dir [/path/to/tacred/data/json] --output_dir ./data/en
Translation
Translation uses the DeepL or Google APIs.
translate_deepl.py and translate_google.py translate a .jsonl dataset into a different language.
The dataset is expected to be in following format (i.e. the JSONL format created above):
{"id": original_id, "tokens": [original_tokens], "label": [original_label], "entities": [original_entities], "grammar": [original_grammar], "type": [original_type]}
The translated result in [output_file.jsonl] appears in the following form:
{"id": original_id, "tokens": [original_tokens], "label": [original_label], "entities": [original_entities], "grammar": [original_grammar], "type": [original_type], "language": [original_language], "tokens_translated": [translated_tokens], "entities_translated": [translated_entities], "language_translated": translated_language, "text_raw": [original_text], "translation_raw": [raw_translation_text]}
The scripts additionally create a file [output_file.jsonl].manual. This file contains all examples in which the
script fails to extract the entities looking at the number and ordering of the entities. If debugging level is set to 'warning',
the logger creates warnings for any such example.
The scripts skip translation for all examples that are in [output_file.jsonl]
and [output_file.jsonl].manual to avoid costly unnecessary translation. Set --overwrite to
re-translate those examples.
DeepL
The script translate_deepl.py translates the dataset into the
target language. You need a valid API key. The following example shows how to translate the
sample file data/en/train_sample.jsonl to German:
python src/translate/translate_deepl.py --api_key [API_KEY] --api_address "https://api.deepl.com/v2/translate"
-i ./data/en/train_sample.jsonl -o ./data/de/train_de_deepl.jsonl -T spacy_de -s EN -t DE
--log_file translate.log
Your output should be similar to data/de/train_de_sample.jsonl. Note that the output file contains
the original English tokens (field tokens) as well as the translated tokens (field tokens_translated) and the raw translation (field translation_raw).
For testing purposes, you can use the character-limited free API endpoint https://api-free.deepl.com/v2/translate (but you still need an API key).
Call python src/translate/translate_deepl.py --help for usage and argument information.
For a list of available languages use the flag --show_languages.
The script translate_google.py translates the dataset into the target
language. You need a valid API key. Follow this setup guide and
place the private key in a secure location.
The following example shows how to translate the sample file data/en/train_sample.jsonl to German:
python src/translate/translate_google.py --private_key [path/to/PRIVATE_KEY]
-i ./data/en/train_sample.jsonl -o ./data/de/train_de_google.jsonl -T spacy_de -s EN -t DE
--log_file translate.log
Call python src/translate/translate_google.py --help for usage and argument information.
For a list of available langugages use the flag --show_languages.
Using an .env file
For convenience, it is best to create a .env file of following content:
INPUT_FILE='/path/to/dataset.jsonl'
OUTPUT_FILE='dataset_translated.jsonl'
# For DeepL
API_KEY="xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx:xx"
API_ADDRESS="https://api.deepl.com/v2/translate"
# or use: API_ADDRESS="https://api-free.deepl.com/v2/translate" (limited to 500K chars/month)
# For Google
PRIVATE_KEY='/path/to/private_key.json'
Tokenizer
The argument -T or --tokenizer gives you a choice between which tokenizer
you want to use to tokenize the raw translation text.
- split (default): python whitespace tokenization
- spacy website, the used language model will be downloaded automatically (currently, the statistical models, not the neural ones).
- trankit website, the used neural model will be downloaded automatically. Requires a GPU to run at reasonable speed!
You can add your own tokenizer by just adding the tokenizer name to
tokenizer_choices and its initialization in the init_tokenizer() function of the
utils.py script.
Logging
The scripts implement logging from the python standard library.
- Set
-vto display logging messages to the console - Set
--log_level [debug,info,warning,error,critical]to determine which kind things should be logged. - Set
--log_file FILEto log to a file. - Alternatively give a logger object as argument to the translate function.
Backtranslation
The script 'backtranslate.py' translates a translated dataset back to its
original language. The function accepts output.jsonl files in the format of the
translate_deepl.py and translate_google.py file. In this case it is necessary
to specify an input, output file and a service:
python src/translate/backtranslate.py [translated.jsonl] [backtranslated.jsonl] [google|deepl]
Additionally, the script accepts the same arguments (and .env file) as
translate_deepl.py and translate_google.py because it uses one of both.
The script is a wrapper for a sequence of calls of scripts. It calls following scripts in this order:
prepare_backtranslation.pycreates a temporary file (default=.temp_[input_file]) with the right format to call the translation script on (switching field names)translate_deepl.pyortranslate_google.pyare called on the temporary file, the outputs are found in two temporary files (default=.temp_[output_file]and.temp_[output_file].manual). Be careful about deleting them, as at it is here where all the raw backtranslation results accumulate.postpare_backtranslation.pyall entries in the temporary file back into an acceptable format
Converting JSONL to TACRED JSON
If you require the translations in the orginal TACRED JSON format, e.g. when using the Huggingface Tacred DatasetReader, you can call the script:
python src/translate/convert_to_json.py --dataset_dir [/path/to/translated/jsonl] --output_dir [json-output-dir] --language [lang_code]
Scripts to wrap Translation, Backtranslation and Conversion to JSON
scripts/translate_deepl.sh and scripts/translate_google.sh wrap translation,
backtranslation, and conversion to JSON for a single language. You do still need
to do the one-time-only step of preparing the JSONL version of the original TACRED!
Relation Extraction Experiments
All experiments are configured with Hydra. Experiment configs are
stored in config/.
Preparing the data
You need to obtain the MultiTACRED dataset from this URL,
and unzip it into the ./data folder.
You also need to download the original, English TACRED dataset,
and place the content of its data/json folder in ./data/en
The file structure should look like this:
data
|-- ar/
|--- train_ar.json
|--- dev_ar.json
|--- test_ar.json
|--- test_en_ar_bt.json
|-- de
|--- ...
|-- en
|--- train.json
|--- dev.json
|--- test.json
|...
To reproduce our results, you should apply the TACRED Revisited patch to
the TACRED json files. We provide a slightly modified version of the original apply_tacred_patch.py script
to account for non-ascii characters (json dump with ensure_ascii=False) and the reduced amount of instances in
dev / test due to translation errors (remove an id check assertion).
git clone https://github.com/DFKI-NLP/tacrev
Then, for any language, run:
# Dev split
python ./scripts/apply_tacrev_patch.py \
--dataset-file ./data/[lang]/dev_[lang].json \
--patch-file [path/to/tacrev/patch/dev_patch.json] \
--output-file ./data/[l
Related Skills
best-practices-researcher
The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app
last30days-skill
15.9kAI agent skill that researches any topic across Reddit, X, YouTube, HN, Polymarket, and the web - then synthesizes a grounded summary
autoresearch
2.8kClaude Autoresearch Skill — Autonomous goal-directed iteration for Claude Code. Inspired by Karpathy's autoresearch. Modify → Verify → Keep/Discard → Repeat forever.
omg-learn
Learning from user corrections by creating skills and patterns. Patterns can prevent mistakes (block/warn/ask) or inject helpful context into prompts
