ProX
[ICML 2025] Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale
Install / Use
/learn @GAIR-NLP/ProXREADME
Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale
<p align="center"> <img src="./static/images/prox-logo.png"> </p> <a href="https://huggingface.co/gair-prox" target="_blank"> <img alt="Models" src="https://img.shields.io/badge/🤗-HuggingFace Repo-blue" /> </a> <a href="https://arxiv.org/abs/2409.17115" target="_blank"> <img alt="Paper" src="https://img.shields.io/badge/📑-Paper-blue" /> </a> <a href="https://gair-nlp.github.io/ProX/" target="_blank"> <img alt="Project Page" src="https://img.shields.io/badge/🧪-Project Page-blue" /> </a> <a href="https://opensource.org/license/apache-2-0" target="_blank"> <img alt="License: apache-2-0" src="https://img.shields.io/github/license/saltstack/salt" /> </a> <a href="https://github.com/GAIR-NLP/ProX" target="_blank"> <img alt="GitHub Stars" src="https://img.shields.io/github/stars/GAIR-NLP/ProX?style=social" /> </a> <a href="https://github.com/GAIR-NLP/ProX/issues" target="_blank"> <img alt="Open Issues" src="https://img.shields.io/github/issues-raw/GAIR-NLP/ProX" /> </a>🔥 News
- [17 February, 2025]: 🎉 We release DCLM-pro, a further cleaned version of DCLM-baseline containing > 500B tokens ready for pre-training. Preliminary experiments show that models trained on DCLM-pro can achieve >1.5% performance gain on average within 50B tokens.
- [10 October, 2024]: 🎉 We release the codebase for large scale data refining, together with the refining models on 🤗Huggingface: Prox-Refining-LMs.
- [19 September, 2024]: 🎉 We open-sourced pre-training corpus curated by our ProX framework, containing > 100B high quality general domain corpus and ~5B high quality math corpus, together with models(ProX and ProXMath) trained using these data.
Table of Contents
🚀 Introduction
🫐 ProX is a lm-based data refinement framework to improve the quality of data used in pre-training large language models. Instead of relying on human experts to create rules, ProX treats data refinement like a programming task. This allows models to automatically clean and improve each data example at a large scale.

Currently, 🫐 ProX curated data have gone through 2 levels of programming + executing: doc-level and chunk-level:
Key Features:
- Better Performance: Models trained with ProX-refined data perform over 2% better than those trained with raw or rule-based data.
- Domain Flexibility: 🫐 ProX works well across different domains, boosting accuracy by up to 20% in tasks like math, without needing special manual adjustments.
- Efficient and Scalable: Even small models (as little as 0.3B parameters) can refine data effectively, similar to human experts, saving resources compared to LLM-based data synthesis.
- Cost-Effective: In general, 🫐 ProX could significantly save on training computing while maintaining strong results.
Setup
First, we have to install all the libraries listed in requirements.txt
git clone https://github.com/GAIR-NLP/ProX.git prox
cd prox
conda create -n prox python=3.10
conda activate prox
pip install -r requirements.txt
For acceleration, we need to install flash-attention with some fused kernels:
<details> <summary>Click me</summary> <p>pip install flash-attn --no-build-isolation
# this part is quite similar to TinyLlama repo
# you can also refer to its detailed guide at: https://github.com/jzhang38/TinyLlama/blob/main/PRETRAIN.md
git clone https://github.com/Dao-AILab/flash-attention.git
cd flash-attention
cd csrc/rotary && pip install .
cd ../layer_norm && pip install .
cd ../xentropy && pip install .
cd ../.. && rm -rf flash-attention
</p>
</details>
Then, we can install lighteval & math-eval for evaluation
<details> <summary> <b>lighteval</b> </summary> <p>conda create -n lmeval python=3.10
git clone https://github.com/huggingface/lighteval.git
cd lighteval
pip install -e .
</p>
</details>
<details>
<summary>
<b>math-eval</b>
</summary>
<p>
git clone https://github.com/GAIR-NLP/math-evaluation-harness.git
cd math-evaluation-harness
conda create -n math_eval python=3.10
conda activate math_eval
pip install -r requirements.txt
</p>
</details>
<details>
<summary>
<b>Pre-training Corpora Download</b>
</summary>
<p>
To facilitate straightforward and meaningful apple-to-apple comparisons in follow-up work, we provide the download details for the pre-training corpora we used:
C4: We downloaded the full set from the following link: https://huggingface.co/datasets/allenai/c4/tree/main/en
FineWeb: We employed a 350B-token sample from https://huggingface.co/datasets/HuggingFaceFW/fineweb/tree/main/sample/350BT, randomly shuffled the data, split it into 7 dumps, and conducted experiments using the first two dumps.
DCLM-baselines: We conducted experiments on a global shard, i.e., https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0/tree/main/global-shard_01_of_10. We applied our ProX on the first five global shards for open-source release, i.e., DCLM-Pro
RedPajama-V2: Please refer to the scripts: auto_download_redpajama.sh and reshuffle_data.py located in the scripts/data_download folder for instructions on downloading and shuffling the data.
Large Scale Data Refining
If you want to refine your own data with ProX, please make sure you setup a new environment.
# create a new conda env
conda create -n refining python=3.10
conda activate refining
# install requirements
pip install -r refining_requirements.txt
We released 2 families of refining models:
- WebRefining-LM: for general web domain, including web-doc-refining-lm and web-chunk-refining-lm
- MathRefining-LM: for math domain, including math-doc-refining-lm and math-chunk-refining-lm
You can refer to the following example slurm scripts to refine large scale pre-training data.
# 1. doc-level refining
sbatch scripts/data_gen/example_doc_refining.sh
# 2. chunk-level refining
sbatch scripts/data_gen/example_chunk_refining.sh
Training on ProX curated data
We provide over 100B high quality general domain corpus and ~5B high quality math corpus. You can directly train your own model using these data.
Here we provide an example to download, tokenize, train a model using 🫐 ProX data with litgpt, finally with thorough evaluation. Feel free to modify the script to fit your own needs.
First step is to setup your environment variables:
# 1. using setup_personal_env and setup_common_env
source setup_personal_env.sh
source setup_common_env.sh
Then you can download the data, and tokenize the data
# 2. download the data, e.g., RedPajama-pro
python scripts/data_download/hf_download.py \
--dataset_name gair-prox/RedPajama-pro
# 3. tokenize the data
export PYTHONPATH=$PYTHONPATH:$TINYLM_WORK_DIR/train
python -m train.data_tokenize.prepare_web \
--source_path $RAW_DATA_DIR/gair-prox/RedPajama-pro \
--tokenizer_path $TINYLM_WORK_DIR/vocab_files/llama_hf \
--destination_path $TOKENIZE_DATA_DIR/RedPajama-pro/llama \
--split train \
--percentage 1.0
You should see many ".bin" files in the destination path. Then you can train a model using the tokenized data.
We run the training script using slurm:
# 4. train / convert / evaluate using slurm + multiple nodes
sbatch scripts/train/tlm/pt_tlm_xs_redpj_prox.sh
You can also run the training script in one local node 👇
<details> <summary>click me</summary> <p># 4.1 train locally
cd train
export PYTHONPATH=$PYTHONPATH:$TINYLM_WORK_DIR/train
python -m pretrain.tinyllama \
--config_path $TINYLM_WORK_DIR/configs/general/<your_config>.yaml
# 4.2 convert to HF model
python -m scripts.weight_conversion.batch_model_conversion \
--litgpt_model_dir pt_llama_0_3b_redpj_25B_prox \ # the model dir you want to convert under ${$PT_MODEL_OUTPUT_DIR}
--hf_model_dir pt_llama_0_3b_redpj_25B_prox \ # the model dir you want to save under ${HF_MODEL_OUTPUT_DIR}
--save_token_interval 1 \ # the interval to save checkpoints, e.g., you can assume 1024 * 2048 * 500 approx. 1B token
--arch_name tiny_LLaMA_0_3b # the model architecture name in train/lit_gpt/config.py
</p>
</details>
Evaluation
General Evaluation
We evaluate the model using lighteval across 10 standard tasks:
- ARC (ARC-Easy, ARC-Challenge)
- CommonsenseQA
- Hellaswag
- MMLU
- OpenbookQA
- PIQA
- SocialIQA
- WinoGrande
- SciQ
Actually, in sbatch script, we hav
Related Skills
node-connect
346.4kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
107.2kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
346.4kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
346.4kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
