MATES
Official repository for MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models [NeurIPS 2024]
Install / Use
/learn @cxcscmu/MATESREADME
[NeurIPS 2024] MATES<img src="assets/avatar.png" alt="drawing" style="height: 1em;">: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models
<p align="center"><a href='https://huggingface.co/yuzc19/pythia-410m-mates'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Main Model-blue'> <a href='https://huggingface.co/yuzc19/bert-base-uncased-data-influence-model-lambada'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Data Influence Model-blue'></p>This is the official repository for MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models. The implementation is mainly based on LitGPT, which is easy to begin with, use, and modify.
<br> <p align="center"> <img src="assets/MATES.png" width="600"> </p> <br>Quick Links
1 Environment
Python version
The code is tested on Python 3.9.17.
Install basic dependencies
pip install -r requirements.txt
2 Dataset
We use a tokenized version of the C4 dataset in our code. Please ensure your disk has at least 500 GB of storage for this dataset. To get the training data for the initial warmup 10k steps, please run:
python src/select_data/select_data.py
- The selected data will be saved in
data/c4/pythia-410m/random/0.
For preprocessing our reference task LAMBADA, please run:
python src/select_data/prepare_lambada.py
- The processed data will be saved in
data/lambada_openai.
3 Experiments
Our main experiments use 8 GPUs for parallelization.
3.1 Pretraining
Our pretraining is run stage by stage to facilitate the model-aware data selection. Each stage consists of 10k steps. For instance, in the initial warmup 10k steps, you can run:
model_name=pythia-410m \
method=random \
ckpt=0 \
decay=false \
bash scripts/pretrain.sh
ckpt=0denotes we are training from scratch.
To resume the pretraining from previous steps (e.g., 10k), you can run:
model_name=pythia-410m \
method=random \
ckpt=40000 \
decay=false \
bash scripts/pretrain.sh
ckpt=40000denotes our gradient accumulation step is 4.method=randomis the random data selection. You can replace it withmatesfor MATES after the first 10k steps.
3.2 Data Selection
After the first 10k steps, we can start the MATES data selection process every 10k steps. One data selection process consists of four steps:
1️⃣ Get oracle data influence:
model_name=pythia-410m \
method=random \
ckpt=40000 \
bash scripts/probe_oracle_data_influence.sh
- For the 10k checkpoint,
method=random, but for the following,method=mates.
2️⃣ Train data influence model:
model_name=pythia-410m \
ckpt=40000 \
bash scripts/train_data_influence_model.sh
3️⃣ Predict data influence:
model_name=pythia-410m \
ckpt=40000 \
bash scripts/predict_data_influence.sh
4️⃣ Select the training data for the next 10k steps:
python src/select_data/select_data.py --model_name pythia-410m --method mates --ckpt 40000
- The selected data will be saved in
data/c4/pythia-410m/mates/40000.
3.3 Evaluation
1️⃣ It is advised to run the evaluation after the decay stage for intermediate checkpoints for better stability.
model_name=pythia-410m \
method=mates \
ckpt=80000 \
decay=true \
bash scripts/pretrain.sh
2️⃣ We provide a simple evaluation example here and you can modify the parameters based on your needs.
model_name=pythia-410m \
method=mates \
ckpt=80800 \
bash scripts/eval.sh
- After running the evaluation script, you can find the results in the
results/c4/$model/$method/iter-$ckpt-ckpt/results.json.
4 Citation
Please cite our paper if you use MATES in your work:
@inproceedings{yu2024mates,
title={MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models},
author={Yu, Zichun and Das, Spandan and Xiong, Chenyan},
booktitle={NeurIPS},
year={2024}
}
