CRAFT: Corpus Retrieval and Augmentation for Fine-Tuning

The CRAFT Pipeline

This repository contains the code, datasets, and additionally required files for the paper CRAFT Your Dataset: Task-Specific Synthetic Data Generation Through Corpus Retrieval and Augmentation.

Synthetic Datasets

We make all size variations of our crafted datasets available on Hugging Face:

BioQA: https://huggingface.co/datasets/ingoziegler/CRAFT-BioQA
CommonSenseQA (CSQA): https://huggingface.co/datasets/ingoziegler/CRAFT-CommonSenseQA
MedQA: https://huggingface.co/datasets/ingoziegler/CRAFT-MedQA
RecipeGen: https://huggingface.co/datasets/ingoziegler/CRAFT-RecipeGen
Summarization: https://huggingface.co/datasets/ingoziegler/CRAFT-Summarization

To use our human-written few-shots, simply filter the dataset for is_few_shot == 1, or load the .jsonl from assets/{task}/few-shot/corpus-task-32.jsonl. Our 8 few-shot-based experiments simply use the first 8 lines of each file.

Performance

Models trained on our synthetic datasets match the performance of general instruction-tuned LLMs and can even outperform training on human-curated data for tasks like summarization.

CRAFT Performance

Our synthesized data is also more robust against distribution shifts because we do not generate data for a specific test set, but for an overall task. This can be seen from the 5-gram overlap between our crafted datasets and the test sets, and comparing it to the in-domin train sets. At maximum, our synthetic datasets have a 0.4% overlap, while in-domain train sets range up to 17.9% 5-gram overlap (see table below).

| | BioQA | CSQA | MedQA | Summarization | |-----------------------------------------------|-------|------|-------|---------------| | CRAFTXS | 0.0% | 0.0% | 0.0% | 0.0% | | CRAFTS | 0.0% | 0.1% | 0.1% | 0.0% | | CRAFTM | 0.0% | 0.2% | 0.1% | 0.1% | | CRAFTL | 0.0% | 0.4% | 0.3% | 0.2% | | CRAFTXL | 0.0% | 0.2% | 0.2% | 0.2% | | Baseline (In-domain Train Set) | 17.9% | 4.4% | 1.1% | 0.3% |

Consequently, CRAFT's performance is stronger and more consistent across other test sets, e.g., MMLU.

| Dataset | Baseline | CRAFTXL | |------------------------------------|----------|--------------------| | In-domain | 89.9 | 78.1 | | MMLUMedical Genetics | 60.0 | 69.0 | | MMLUAnatomy | 55.6 | 57.0 | | MMLUHigh School Biology | 69.3 | 67.4 | | MMLUCollege Biology | 66.7 | 74.3 | | MMLU-Avg | 62.9 | 66.9 |

Adapter Checkpoints:

Here, we provide the download links for the adapter checkpoints resulting from our fine-tuning on the CRAFT-XL versions.

BioQA: https://huggingface.co/ingoziegler/CRAFT-BioQA-XL
CommonSenseQA (CSQA): https://huggingface.co/ingoziegler/CRAFT-CommonSenseQA-XL
MedQA: https://huggingface.co/ingoziegler/CRAFT-MedQA-XL
RecipeGen: https://huggingface.co/ingoziegler/CRAFT-RecipeGen-XL
Summarization: https://huggingface.co/ingoziegler/CRAFT-Summarization-XL

Running CRAFT

Our experiments are based around Python 3.10.9, Pytorch 2.2.1, and vllm 0.4.1. For details, check requirements.txt.

The pipeline consists of 5 steps that have to be performed sequentially.

If you want to CRAFT your own dataset:

In general, you have to follow the same steps as if you were reproducing our experiments as described below.

Embedding Database

You can either use our embedding database and corpora mentioned under Step 0 below, or you can extend our embedding database with your public/private corpora, or you can create your own specialized embedding database with corresponding corpora using your private or other public datasets.

Currently, our code only features the experiments from our paper ready, so you will need to adapt the scripts and run configs a bit. You can still use large parts from our run configs from code/run_configs/ as the baseline, but you will need to change the paths pointing to our databases and corpora files. Additionally, when running finetuning and evaluation, you will need to write code for your task sample design, as well as provide your evaluation dataset and format it accordingly. Nonetheless, the general structure stays the same.

To create an embedding database, we provide the files we used to embed our corpora and create the database.

Run python3 code/create_embeddings.py $(code/run_configs/embed/stackexchange.cfg)
This will create an .h5 database with 16-bit precision NumPy arrays using multi-qa-MiniLM-L6-cos-v1 from the SentenceTransformer suite as the embedding model.
The embedding database is set up in a way where each document's embedding corresponds to one 'row' in the H5 database
Therefore, you can retrieve documents by enumerating the documents in your corpus, and retrieve the corresponding array from the embedding database, or vice-versa.

New tasks

You need to create 8 to 32 few-shots with the content and design of your task. See our provided few-shots and the image below as examples for the different tasks.
- Place them under assets/{task}/few-shot/corpus-task-32.jsonl
The rest of the pipeline stays the same. Continue with Step 1 from below

Few-Shot Design

If you have any questions, feel free to open a GitHub issue.

Reproducing our experiments

Have a look at code/utils/args.py for all available runtime arguments. We provide our pre-filled argparse run configs for all experiments under code/run_configs/. The few-shots for all tasks are also available in assets/{task}/few-shot/corpus-task-32.jsonl, so you can start running/reproducing our experiments.

Step 0: Download required files and set up the directory structure

Embedding database: Download our embedding database from the link below and place it under datasets/embeddings.h5
- http://data.cis.lmu.de/data/craft/embeddings.h5
- Please note that we host the files on an http address. If your browser autocompletes to https, you may need to manually adjust the link. Sometimes, you may also have to paste the address into the search bar directly.
- We provide the sha256 checksum for the file in this repository in checksum
C4: Download the 305GB en version of C4 from Hugging Face. We used the Git download version. It is not mentioned there that you have to run git lfs checkout after everything is downloaded so that the lazy files are actually linked to the downloaded files.
Wikipedia: Download our cleaned Wikipedia corpus samples from the link below and place them under datasets/wikipedia/cleaned/
- http://data.cis.lmu.de/data/craft/wikipedia_cleaned.tar.gz
- Please note that we host the files on an http address. If your browser autocompletes to https, you may need to manually adjust the link. Sometimes, you may also have to paste the address into the search bar directly.
- We provide the sha256 checksum for the file in this repository in checksum
WikiHow: Download our cleaned WikiHow corpus samples from the link below and place them under datasets/wikihow/cleaned/
- http://data.cis.lmu.de/data/craft/wikihow_cleaned.tar.gz
- Please note that we host the files on an http address. If your browser autocompletes to https, you may need to manually adjust the link. Sometimes, you may also have to paste the address into the search bar directly.
- We provide the sha256 checksum for the file in this repository in checksum
StackExchange: Download our cleaned StackExchange corpus samples from the link below and place them under datasets/stackexchange/cleaned/
- http://data.cis.lmu.de/data/craft/stackexchange_cleaned.tar.gz
- Please note that we host the files on an http address. If your browser autocompletes to https, you may need to manually adjust the link. Sometimes, you may also have to paste the address into the search bar directly.
- We provide the sha256 checksum for the file in this repository in checksum
Make sure that each task folder under assets/ has the following subfolder available: assets/{task}/corpus_samples/, assets/{task}/outputs/, assets/{task}/results/, assets/{task}/task_samples/
Create a model_ckpts directory. All LoRA adapters will be saved here
Create a models/hf_models directory and place the model you want to use for task sample creation in there (e.g. Mistral 7B Instruct v0.2), as well as the model you want to fine-tune (e.g. Mistral 7B v0.2), and the model you want to evaluate again (e.g. Mistral 7B Instruct

CRAFT

Install / Use

README