GoLLIE
Guideline following Large Language Model for Information Extraction
Install / Use
/learn @hitz-zentroa/GoLLIEREADME
- 📒 Blog Post: GoLLIE: Guideline-following Large Language Model for Information Extraction
- 📖 Paper: GoLLIE: Annotation Guidelines improve Zero-Shot Information-Extraction
- <img src="assets/GoLLIE.png" width="20">GoLLIE in the 🤗HuggingFace Hub: HiTZ/gollie
- 🚀 Example Jupyter Notebooks: GoLLIE Notebooks
Schema definition and inference example
The labels are represented as Python classes, and the guidelines or instructions are introduced as docstrings. The model start generating after the result = [ line.
Installation
You will need to install the following dependencies to run the GoLLIE codebase:
Pytorch >= 2.0.0 | https://pytorch.org/get-started
We recommend that you install the 2.1.0 version or newer, as it includes important bug fixes.
transformers >= 4.33.1
pip install --upgrade transformers
PEFT >= 0.4.0
pip install --upgrade peft
bitsandbytes >= 0.40.0
pip install --upgrade bitsandbytes
Flash Attention 2.0
pip install flash-attn --no-build-isolation
pip install git+https://github.com/HazyResearch/flash-attention.git#subdirectory=csrc/rotary
You will also need these dependencies
pip install numpy black Jinja2 tqdm rich psutil datasets ruff wandb fschat
Pretrained models
We release three GoLLIE models based on CODE-LLama (7B, 13B, and 34B). The models are available in the 🤗HuggingFace Hub.
| Model | Supervised average F1 | Zero-shot average F1 | 🤗HuggingFace Hub | |---|:---------------------:|:--------------------:|:---------------------------------------------------------:| | GoLLIE-7B | 73.0 | 55.3 | HiTZ/GoLLIE-7B | | GoLLIE-13B | 73.9 | 56.0 | HiTZ/GoLLIE-13B | | GoLLIE-34B | 75.0 | 57.2 | HiTZ/GoLLIE-34B |
How to use GoLLIE
Please take a look at our 🚀 Example Jupyter Notebooks to learn how to use GoLLIE: GoLLIE Notebooks
Currently supported tasks
This is the list of task used for training and evaluating GoLLIE. However, as demonstrated in the 🚀 Create Custom Task notebook GoLLIE can perform a wide range of unseen tasks. For more info, read our 📖Paper.
<p align="center"> <img src="assets/datasets.png"> </p>We plan to continue adding more tasks to the list. If you want to contribute, please feel free to open a PR or contact us. You can use as example the already implemented tasks in the src/tasks folder.
Generate the GoLLIE dataset
The configuration files used to generate the GoLLIE dataset are available in the configs/data_configs/ folder. You can generate the dataset by running the following command (See bash_scripts/generate_data.sh for more info):
CONFIG_DIR="configs/data_configs"
OUTPUT_DIR="data/processed_w_examples"
python -m src.generate_data \
--configs \
${CONFIG_DIR}/ace_config.json \
${CONFIG_DIR}/bc5cdr_config.json \
${CONFIG_DIR}/broadtwitter_config.json \
${CONFIG_DIR}/casie_config.json \
${CONFIG_DIR}/conll03_config.json \
${CONFIG_DIR}/crossner_ai_config.json \
${CONFIG_DIR}/crossner_literature_config.json \
${CONFIG_DIR}/crossner_music_config.json \
${CONFIG_DIR}/crossner_politics_config.json \
${CONFIG_DIR}/crossner_science_config.json \
${CONFIG_DIR}/diann_config.json \
${CONFIG_DIR}/e3c_config.json \
${CONFIG_DIR}/europarl_config.json \
${CONFIG_DIR}/fabner_config.json \
${CONFIG_DIR}/harveyner_config.json \
${CONFIG_DIR}/mitmovie_config.json \
${CONFIG_DIR}/mitrestaurant_config.json \
${CONFIG_DIR}/mitmovie_config.json \
${CONFIG_DIR}/multinerd_config.json \
${CONFIG_DIR}/ncbidisease_config.json \
${CONFIG_DIR}/ontonotes_config.json \
${CONFIG_DIR}/rams_config.json \
${CONFIG_DIR}/tacred_config.json \
${CONFIG_DIR}/wikievents_config.json \
${CONFIG_DIR}/wnut17_config.json \
--output ${OUTPUT_DIR} \
--overwrite_output_dir \
--include_examples
We do not redistribute the datasets used to train and evaluate GoLLIE. Not all of them are publicly available; some require a license to access them.
For the datasets available in the HuggingFace Datasets library, the script will download them automatically.
For the following datasets, you must provide the path to the dataset by modifying the corresponding configs/data_configs/ file: ACE05 (Preprocessing script), CASIE, CrossNer, DIANN, E3C, HarveyNER, MitMovie, MitRestaurant, RAMS, TACRED, WikiEvents.
Regarding the ACE05 dataset, you can obtain the splits from the code of OneIE paper: [http://blender.cs.illinois.edu/software/oneie/](h
