Ineqmath
Solving Inequality Proofs with Large Language Models.
Install / Use
/learn @lupantech/IneqmathREADME
Welcome to the official repository for the paper "Solving Inequality Proofs with Large Language Models". With this toolkit you can evaluate your own models on our benchmark, plug in our improvement strategies to boost performance, reproduce the paper's experiments end-to-end, and train more effectively with our data.
Dive in and push the frontier of inequality and mathematics reasoning with us! 🚀
📑 Table of Contents
- 💥 News
- 📖 Introduction
- 📊 Dataset Examples
- 🏆 Leaderboard
- 📝 Evaluations on IneqMath
- 🎯 Strategies on IneqMath
- 🧪 Other experiments on IneqMath
- 🤖 Supported LLM Engines
- 📂 Data Curation
- 🤗 Dataset Overview
- 🧐 Fine-grained Informal Judges
- 📈 Evaluation Results
- 🔍 In-depth Study
- 📜 License
- 📚 Citation
<a id="news"></a>
💥 News
- [2025.09.18] 🎉 Excited to announce that our IneqMath paper has been accepted at NeurIPS 2025 as a Spotlight (Top 3% of all submissions)!
- [2025.08.18] 🎯 We've just released our novel reformulation pipeline that transforms inequality proofs into informal yet verifiable subtasks. Plus, our brand-new training data enhancement scripts are here to supercharge your model training. Check out our Data Curation section and start building better data and models today! ✨
- [2025.08.17] 🚀 Supercharge your IneqMath runs with our new improvement strategies—Frequent Theorems as Hints, Frequent Training Problems & Solutions as Hints, and Few-shot Evaluation. Give them a try and let us know if there are other strategies you'd like to see!
- [2025.08.16] 🚀 Updated example scripts across all supported model families—explore them here!
- [2025.08.14] ✨ Released the dev set evaluation scripts and our Final Answer Judge—try them now! Dev set evaluation | Final Answer Judge.
- [2025.08.08] 💥 GPT-5 (medium, 30K) 🥇 Sets New SOTA on IneqMath with overall accuracy 47.0%! Read more on the OpenAI Platform.
- [2025.07.25] ✨ Released our dev set evaluation platform to help you fine-tune models more effectively! Get started: Dev set evaluation platform.
- [2025.07.19] 📄 Added support for vLLM. You can now use any vLLM-supported models and your local checkpoint models. Check out the example scripts for more details.
- [2025.07.18] 🎉 Our paper has been accepted to the 2nd AI for Math Workshop at ICML 2025—read the paper and view the poster!
- [2025.06.16] 💥 o3-pro (40K) 🥇 Sets New SOTA on IneqMath with overall accuracy 46.0%! Read more on the OpenAI Platform.
- [2025.06.11] Our work is featured by Pan Lu on Twitter!
- [2025.06.09] Our paper is now accessible at https://arxiv.org/abs/2506.07927.
<a id="introduction"></a>
📖 Introduction
Inequality proving, crucial across diverse scientific and mathematical fields, tests advanced reasoning skills such as discovering tight bounds and strategic theorem application. LLM's progress in this area is hampered by existing datasets that are often scarce, synthetic, or rigidly formal. We address this by proposing an <b>informal yet verifiable</b> task formulation, recasting inequality proving into two automatically checkable subtasks: <b>bound estimation</b> and <b>relation prediction</b>.
Building on this, we propose <b><span style="color:#103756;">Ineq</span><span style="color:#D03C36;">Math</span></b>, an expert-curated dataset of Olympiad-level inequalities, including a test set and a training corpus enriched with stepwise solutions and theorem annotations. We also develop a novel LLM-as-judge evaluation framework, combining a final-answer judge with four stepwise judges designed to detect common reasoning flaws. Thus, we can assess LLMs’ informal reasoning abilities, where they generally perform well, even though they consistently struggle with formal problem solving.
A systematic evaluation of 29 leading LLMs on <b><span style="color:#103756;">Ineq</span><span style="color:#D03C36;">Math</span></b> reveals a surprising reality: even top models like o1 achieve less than 10% overall accuracy under stepwise scrutiny; this is a drop of up to 65.5% from their accuracy when considering only final answer equivalence. This discrepancy exposes fragile deductive chains and a critical gap for current LLMs between merely finding an answer and constructing a rigorous proof. Scaling model size and increasing test-time computation yield limited gains in overall proof correctness. Instead, our findings highlight promising research directions such as theorem-guided reasoning and self-refinement.
<a id="dataset-examples"></a>
📊 Dataset Examples
Below are training and testing examples from <b><span style="color:#103756;">Ineq</span><span style="color:#D03C36;">Math</span></b>. Each problem belongs to one of two automatically checkable subtasks: bound estimation or relation prediction. Each training problem includes stepwise solutions, with up to four solutions per problem, and 76.8% (962 problems) are annotated with relevant theorems. The test problems are each crafted and reviewed by IMO-level medalists to ensure both originality and difficulty.
Training examples of <b><span styl
