</div> <p align="center"> <img src="./images/first_table2.jpg" width="500" /> </p> <p align="center"> | <a href="https://arxiv.org/abs/2403.00799">ArXiv</a> | <a href="https://pan.quark.cn/s/2d16e640ed07">Models</a> | <a href="https://huggingface.co/datasets/cyzhh/MMOS">Data</a> | <a href="https://github.com/cyzhh/MMOS">Code</a> | </p>

🔥 News

[2024/6/22] Revised the article and added attempts on automatic theorem proving tasks. Codes are in MMOS-F2F.
[2024/3/30] Update result on MMOS-Code 34B and MMOS-LLEMMA 34B Notice the vllm and transformers version.
[2024/3/8] 🔥🔥🔥Models MMOS-DeepSeekMath 7B show nice performence with self-consistency and k=50 !!
[2024/2/28] 🔥 Models MMOS-DeepSeekMath 7B show nice performence and released at MMOS-DeepSeekMath 7B !!
[2024/2/27] 🔥 Models MMOS-LLEMMA 7B show nice performence and released at MMOS-LLEMMA 7B !!
[2024/2/27] 🔥 Models MMOS-CODE 13B and MMOS-CODE 34B released at MMOS-CODE 13B and MMOS-CODE 34B !!
[2024/2/27] 🔥 Models MMOS-CODE 7B released at MMOS-CODE 7B !!
[2024/2/26] 🔥🔥🔥 Dataset MMOS released at 😊 HuggingFace !!
[2024/2/23] 🔥🔥🔥Arxiv released at An Empirical Study of Data Ability Boundary in LLMs' Math Reasoning ~

💡 Introductions & Performances

Mix of Minimal Optimal Sets (MMOS) of dataset has two advantages for two aspects, higher performance and lower construction costs on math reasoning.

| Model | Size | GSM8K | SVAMP | ASDiv | MATH | Size | GSM8K | SVAMP | ASDiv | MATH |Size | GSM8K | SVAMP | ASDiv | MATH | |------------------|------|-------|-------|-------|------|------|-------|-------|-------|------|------|-------|-------|-------|------| | WizardMath | 7B | 54.9 | 57.3 | 59.1 | 10.7 | 13B | 63.9 | 64.3 | 65.8 | 14.0 | 34B | - | - | - | - | | MAMMOTH | 7B | 53.6 | 67.7 | 31.5 | - | 13B | 62.0 | 72.4 | - | 34.2 | 34B | - | - | - | - | | MetaMath | 7B | 66.5 | - | - | 19.8 | 13B | 72.3 | - | - | 22.4 | 34B | - | - | - | - | | MathCoder-L | 7B | 64.2 | 71.5 | - | 23.3 | 13B | 72.6 | 76.9 | - | 29.9 | 34B | - | - | - | - | | MathCoder-CL | 7B | 67.8 | 70.7 | - | 30.2 | 13B | 74.1 | 78.0 | - | 35.9 | 34B | - | - | - | - | | TORA | 7B | 68.8 | 68.2 | 73.9 | 40.1 | 13B | 72.7 | 72.9 | 77.2 | 43.0 | 34B | - | - | - | - | | TORA-CODE | 7B | 72.6 | 70.4 | 78.7 | 44.6 | 13B | 75.8 | 75.7 | 81.4 | 48.1 | 34B | 80.7 | 80.5 | 84.2 | 50.8 | | MMOS | 7B | 69.9 | 73.4 | 76.8 | 40.2 | 13B | 74.8 | 77.0 | 80.0 | 43.2 | 34B | - | - | - | - | | MMOS-CODE | 7B | 73.9 | 76.4 | 78.6 | 44.3 | 13B | 77.1 | 77.5 | 81.9 | 48.1 | 34B |81.7|81.9|82.8|48.8| | MMOS-MinCODE | 7B | 70.3 | 72.5 | 76.7 | 44.6 | 13B | - | - | - | - | 34B | - | - | - | - | | MMOS-LLEMMA | 7B | 76.5 | 77.7 | 81.4 | 48.8 | 13B | - | - | - | - | 34B |82.8|81.8|84.8|51.3| | MMOS-DeepSeekMath | 7B | 80.5 | 79.3 | 87.6 | 55.0 | 13B | - | - | - | - | 34B | - | - | - | - | | MMOS-DeepSeekMath(SC,k=50) | 7B | 87.2 | - | - | 63.7 | 13B | - | - | - | - | 34B | - | - | - | - |

💾 Install

git clone https://github.com/cyzhh/MMOS.git
cd MMOS
conda create -n MMOS python=3.10 
conda activate MMOS
pip install -r requirements.txt

📚 Dataset

To identify the minimal optimal set, we follow these steps:

Sample a sufficient number of correct reasoning paths to form initial set.
Implement a deduplication algorithm to obtain its deduplicated subset.
Conduct a statistical analysis on the upper limit of reasoning paths per question k with the subset data amount N.
Perform SFT on several subsets to analyze the impact of removing duplicates and keeping varied reasoning paths.

We use ToRA series to generate QA-pairs from open source dataset GSM8K, MATH, TAL-SCQ. The QA-pairs are processed by our deduplication algorithm, resulting in the dataset MMOS. The total number of QA-pairs is 135K.

The DATA, which we publish at 😊 HuggingFace, need to be placed under the relative path, ./train_data/MMOS/.

If you are interested in our work, we will publish details about the data processing aspects after the paper is published.

Create your own MMOS dataset

Following scripts/generate.sh:

Prepare your sampling results.
Combine the results.
Extract the true cases.
Dedup the cases.
(Filter) and rerank.

⚙️ Auto Problem Generator

You can generate a data set for testing the numerical robustness of model performance by executing the following script command：

bash scripts/generate.sh
bash scripts/attack.sh
bash scripts/rerank.sh

🚀 Training

Due to resource constraints, we performed supervised fine-tuning on CodeLLaMA 7B, CodeLLaMA 13B and CodeLLaMA 34B using our dataset on A100 40G GPUs. To reproduce our work from CodeLLaMA 7B/13B, you can train according to the following instruction. You can also train the 34B model through DDP script instructions.

bash scripts/train_single.sh codellama 7b
bash scripts/train.sh codellama 34b

💻 Inference

bash scripts/infer.sh

📜 Citations

If you find this repository helpful, please consider citing our paper:

@misc{chen2024empirical,
      title={An Empirical Study of Data Ability Boundary in LLMs' Math Reasoning}, 
      author={Zui Chen and Yezeng Chen and Jiaqi Han and Zhijie Huang and Ji Qi and Yi Zhou},
      year={2024},
      eprint={2403.00799},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

😇 Acknowledgements

ToRA

MMOS

Install / Use

README