SkillAgentSearch skills...

Matharena

Evaluation of LLMs on latest math competitions

Install / Use

/learn @eth-sri/Matharena
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<div align="center"> <h1><img height="150px" src="./images/matharena_icon.png" alt="MathArena"><br>MathArena</h1> <a href="https://www.python.org/"> <img alt="Build" src="https://img.shields.io/badge/Python-3.12-1f425f.svg?color=blue"> </a> <a href="https://opensource.org/licenses/MIT"> <img alt="License: MIT" src="https://img.shields.io/badge/License-MIT-green.svg"> </a> <a href="https://huggingface.co/MathArena"> <img alt="MathArena Datasets" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Matharena-ffc107?color=ffc107&logoColor=white"> </a> </div>

👋 Overview

MathArena (NeurIPS D&B '25) is a platform for evaluation of LLMs on latest math competitions and olympiads. It is hosted on matharena.ai. This repository contains all code used for model evaluation. This README explains how to run your models or add a new competition. You can find logs from our evaluation containing full reasoning traces (if available) and solutions produced by the models on our HuggingFace page: https://huggingface.co/MathArena.

📑 Table of Contents


🚀 Installation

MathArena uses UV to manage dependencies. If you want to run local models, uncomment the vllm installation in pyproject.toml.

Install UV

  • macOS and Linux:
    curl -LsSf https://astral.sh/uv/install.sh | sh
    
  • Windows:
    powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"
    

Alternative installation

As an alternative to UV, you can also create a conda environment and install the package as follows:

conda create -n matharena python=3.12
conda activate matharena
python -m pip install -e .

If you choose this option, disregard uv run in all instructions and use python directly instead.


🏃 Running an Eval

Execute the following command to evaluate a model on a competition:

uv run python scripts/run.py --comp path/to/competition --models path/to/model1
  • path/to/competition: Relative path from the configs/competition folder to the competition config file (excluding the .yaml extension).
  • path/to/model1: Relative path (or multiple) from the configs/models folder to the model config file (excluding the .yaml extension). See Adding a Model/Agent below for model config file structure.

Example:

uv run python scripts/run.py --comp aime/aime_2025 --models openai/gpt-4o 

Additional Flags:

  • --n: Number of runs per problem (default: 4).
  • --redo-all: Ignore existing runs for this model and rerun everything (default: false, continues from existing runs found in outputs/).
  • --problems: One-based indices of problems to run (default: runs all problems).

What Does This Do?

This instantiates a Runner (runner.py) which loads competition problems (from HuggingFace or locally) and instantiates a <b>Solver</b> corresponding to either a pure model (solvers/pure_model_solver.py) or an agent (solvers/agent_pool.py). See Adding a Model/Agent for more details on agents.

The runner prompts the LLM API (api_client.py) to solve each problem n times. Each run is then parsed (parser.py) and graded against the gold solution (grader.py). Finally, all data from runs (runs.py) is normalized into a common API-independent format and saved under outputs/.

Note: There are several layers of retries during one run, accounting for rate limiting and other API errors. Runs are not accepted if the model fails to report an answer; to make this less common, we reprompt the model one last time if no answer was reported (solver.last_chance). Still, run.py might finish without producing n runs for each problem. In this case repeat the run, which will by default not repeat the successful runs found in outputs/.

Updating Runs

Running uv run python scripts/regrade.py can be used to update saved runs in several ways:

  • Update formatting inconsistencies in serialized runs, most importantly model interactions.
  • Rerun parsing and grading on existing model interactions (useful if parser/grader have been patched after the run).
  • Recompute costs based on token usage (useful if API costs have been updated after the run).

For a default run that regrades all of euler/euler with default parameters (N=4, all updates) run uv run python scripts/regrade.py --comps euler/euler.

Another useful script is scripts/nuke_single_run.py which given a path to a runs file in outputs/ removes a specific run at a given index.

Tracking Progress and Debugging Runs

There are several ways to track progress and debug runs:

  1. Track files under logs/status which show an updated overview of the progress of all current runs.
  2. Inspect logs/requests which verbatim logs each request made to an API in api_client.py. As final outputs are postprocessed to a common format, this can be useful to identify API-specific errors.
  3. Inspect logs/broken_runs for runs which unexpectedly could not be saved.
  4. Launch a local web server that inspects all successful runs that were saved to output: uv run python scripts/app.py --comp path/to/competition, and access it at http://localhost:5001/. This shows the final answers but also full interactions with the model or all steps that an agent took (see for example the runs of GPT-5 Agent on apex/apex_2025). Warning signs for runs indicate potential problems and should be manually verified. Any warning is caused by one of the following problems:
  • 💀: parser threw an error or encountered something unexpected.
  • ⚠️: The correct answer might be present in the model answer, but it was not extracted.
  • ❕: Model likely hit max token limit.

If issues are found, delete all runs for that problem by deleting the corresponding output file or use runs.py:drop_runs for selective removal. After that, call run.py again or only repeat the grading using scripts/regrade.py as described above. If the parser requires a manual overwrite, you can do so in the app by clicking on the run, which will show the model answer and allow you to overwrite the correctness of the parsed final answer.

Uploading Answers to HuggingFace

You can upload the model answers to HuggingFace as follows:

uv run python scripts/curation/upload_outputs.py --org your_org --repo-name your_repo_name --comp path/to/competition

This will upload all model answers to a private repository named your_org/your_repo_name. path/to/competition is the relative path from the configs/competition folder to the competition folder (excluding the .yaml extension).

Project Euler

For Project Euler, several additional steps need to be taken. Please check README_euler.md for full details.


🤖 Adding a New Model/Agent

To add a new model add a config file in the configs/models folder. Each config must include:

  • Required:
    • model: Model name. Reasoning effort of OpenAI models can be set by appending --[low/medium/high] to the model name, e.g., o3-mini--high.
    • api: API provider. The API key should be defined as an environment variable when using the specified API. The supported options with their corresponding API keys are:
      • xai: XAI_API_KEY
      • openai: OPENAI_API_KEY
      • together: TOGETHER_API_KEY
      • google: GOOGLE_API_KEY
      • anthropic: ANTHROPIC_API_KEY
      • glm: GLM_API_KEY
      • deepseek: DEEPSEEK_API_KEY
      • openrouter: OPENROUTER_API_KEY
      • vllm: (runs locally; no API key required)
    • human_readable_id: A unique, descriptive identifier.
  • Optional Parameters:
    • API settings like temperature, top_p, and top_k.
    • max_tokens: Max number of tokens for the model.
    • concurrent_requests: Number of parallel requests to API (default: 30).
    • timeout: Request timeout in seconds (default: 2000).
    • max_retries: Retry attempts to API (default: 50).
    • read_cost & write_cost: Cost per million tokens in USD for input and output tokens (default: 1 each).
    • cache_read_cost: Cost per million cached input tokens in USD (default: same as read_cost).
    • date: Release date of the model in the format "yyyy-mm-dd".
    • batch_processing: If set to true, the model will be queried using batch processing. Only available for OpenAI and Anthropic models.
    • use_openai_responses_api: If set to true, will use the OpenAI responses API (instead of chat completions).
    • Other model/provider specific parameters (config, provider, reasoning, etc.).

Agents

Agents are defined via top-level config files (see e.g., config/models/openai/gpt-5-agent.yaml) that point to a pure model config, indicating the underlying LLM API used by the agent, and an agent s

View on GitHub
GitHub Stars237
CategoryDevelopment
Updated1d ago
Forks29

Languages

Python

Security Score

100/100

Audited on Mar 28, 2026

No findings