Can Language Models Discover Scaling Laws?

Official repository for the paper: "Can Language Models Discover Scaling Laws?"

📰 News

[2026.01.26] 🎉 Our paper has been accepted at ICLR 2026!
[2026.01.20] 📝 Check out our main blog post for an accessible overview of our work!

SLDAgent is an evolution-based AI agent that autonomously discovers scaling laws for large language models. This work introduces SLDBench, the first comprehensive benchmark for scaling law discovery, and demonstrates that AI agents can uncover laws that are more accurate and conceptually sound than their human-derived counterparts.

The agent co-optimizes both the symbolic formula of a scaling law and the parameter-fitting algorithm, enabling it to explore complex relationships and achieve superhuman performance in predicting model behavior at scale.

🔗 Quick Links

| Resource | Link | |:---------|:-----| | 📄 Paper | arXiv:2507.21184 | | 📊 Dataset | SLDBench on Hugging Face | | 🏆 Leaderboard | linhaowei1.github.io/scaling_law_discovery | | 🚢 Harbor Adapter | harbor-datasets/sldbench | | 🔧 OpenEvolve Framework | github.com/codelion/openevolve |

🔬 Overview

Scaling laws are fundamental to understanding and predicting the behavior of large language models as they scale in size, data, and compute. However, discovering these laws has traditionally been a manual, labor-intensive process requiring significant domain expertise.

Key Contributions:

SLDAgent: An AI agent that autonomously discovers scaling laws through evolutionary search
SLDBench: A comprehensive benchmark containing 8 diverse scaling law discovery tasks
Superhuman Performance: Agent-discovered laws outperform human expert baselines on multiple tasks
Open-Ended Discovery: Agents can discover novel scaling law formulations not present in existing literature

🚢 Running SLDBench on General Code Agents

SLDBench has been integrated as an adapter in Terminal-Bench Harbor, enabling evaluation of general-purpose code agents on scaling law discovery tasks.

To run SLDBench on your own agent:

Follow the Terminal-Bench documentation: Visit tbench.ai to learn about the Harbor evaluation framework
Use the SLDBench adapter: The adapter is available at harbor-datasets/sldbench
Submit to the leaderboard: View results and rankings at our Leaderboard

📦 SLDBench: The Benchmark

SLDBench is the first comprehensive benchmark for scaling law discovery, curated from over 5,000 LLM training experiments from existing research literature. The benchmark evaluates an agent's ability to:

Analyze experimental data from LLM training runs
Hypothesize functional forms (power laws, mixture models, etc.)
Optimize parameters to fit the observed data
Extrapolate accurately to unseen regimes (larger models, more data, etc.)

Tasks

| Task | Description | Config File | | :--- | :--- | :--- | | Parallel Scaling Law | Models the effect of parallelism P and model size N on loss | configs/parallel_scaling_law.yaml | | Vocabulary Scaling Law | Models unigram-normalized loss as a function of non-vocabulary model size N, vocabulary size V, and dataset size D | configs/vocab_scaling_law.yaml | | SFT Scaling Law | Models supervised fine-tuning loss based on dataset size D across various base models | configs/sft_scaling_law.yaml | | Domain Mixture Scaling Law | Models pre-training loss for domains based on their proportion in the training mixture | configs/domain_mixture_scaling_law.yaml | | MoE Scaling Law | Models loss in relation to network size N and number of experts E in Mixture-of-Experts architectures | configs/moe_scaling_law.yaml | | Data Constrained Scaling Law | Models pre-training loss as a function of model size N, dataset size D, and unique tokens U | configs/data_constrained_scaling_law.yaml | | Learning Rate & Batch Size Scaling Law | Models pre-training loss based on learning rate η, batch size b, dataset size D, and network size N | configs/lr_bsz_scaling_law.yaml | | U-Shaped Scaling Law | An adversarial extrapolation regime probing non-monotonic (U-shaped or double-descent) scaling behaviors | configs/easy_question_scaling_law.yaml |

Dataset: All experimental data is centrally hosted on Hugging Face Hub at pkuHaowei/sldbench.

Evaluation Metrics:

R² (Coefficient of Determination): Primary metric measuring extrapolation accuracy (1.0 = perfect)
NMSE (Normalized Mean Squared Error): Secondary error metric
NMAE (Normalized Mean Absolute Error): Secondary error metric

📋 Requirements

Python 3.13+
uv package manager (recommended) or pip
An OpenAI-compatible LLM API key (set OPENAI_API_KEY)
macOS/Linux/Windows

Note: uv run guarantees commands execute inside a synchronized project environment. If you prefer plain pip, you can adapt the commands accordingly.

🛠️ Installation

Option 1: Using `uv` (Recommended)

# Clone the repository
git clone https://github.com/linhaowei1/SLD.git
cd SLD

# Install dependencies
uv sync

# Set your LLM API key
export OPENAI_API_KEY="your_key_here"

# Optional: Configure non-default API endpoint
# export OPENAI_BASE_URL="https://your.openai.compatible.endpoint/v1"

Windows (PowerShell):

$env:OPENAI_API_KEY="your_key_here"
# $env:OPENAI_BASE_URL="https://your.openai.compatible.endpoint/v1"

Option 2: Using `pip`

# Clone the repository
git clone https://github.com/linhaowei1/SLD.git
cd SLD

# Create and activate a virtual environment
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate

# Install dependencies
pip install -U pip
pip install -e .

# Set your API key
export OPENAI_API_KEY="your_key_here"

🚀 Quick Start

Run a Single Task

# Example: Data-Constrained Scaling Law discovery
EVAL_TASK_NAME="data_constrained_scaling_law" \
uv run openevolve-run \
  --config configs/data_constrained_scaling_law.yaml \
  init_program.py evaluator.py \
  --output results/data_constrained_scaling_law/run_1

Run All Tasks in Batch

# Execute all 8 tasks across multiple models
bash scripts/run.sh

This will:

Run each task 5 times per model with different random seeds
Save outputs to results/{task_name}/{model}/run_{1,2,3,4,5}/
Store intermediate checkpoints during evolution
Evaluate and save the best program from each run

📂 Project Structure

SLD/
├── configs/                     # Task configuration files
│   ├── data_constrained_scaling_law.yaml
│   ├── domain_mixture_scaling_law.yaml
│   ├── easy_question_scaling_law.yaml
│   ├── lr_bsz_scaling_law.yaml
│   ├── moe_scaling_law.yaml
│   ├── parallel_scaling_law.yaml
│   ├── sft_scaling_law.yaml
│   └── vocab_scaling_law.yaml
├── data_loader.py               # Unified data loading from HuggingFace
├── evaluator.py                 # Evaluation system with R², NMSE, NMAE metrics
├── init_program.py              # Initial scaling law template for evolution
├── results/                     # Experiment outputs (auto-generated)
│   └── {task_name}/
│       └── {model}/
│           └── run_{1,2,3,4,5}/
│               ├── checkpoints/     # Evolution checkpoints
│               └── best/            # Best discovered program
├── scripts/
│   └── run.sh                   # Batch execution script
├── pyproject.toml               # Python dependencies
├── CONTRIBUTING.md              # Guide for contributing new tasks
└── README.md

🏃 Usage

Single Task Execution

export EVAL_TASK_NAME="data_constrained_scaling_law"
uv run openevolve-run \
  --config configs/data_constrained_scaling_law.yaml \
  init_program.py evaluator.py \
  --output results/data_constrained_scaling_law/run_1

Batch Execution

bash scripts/run.sh

Evaluating a Discovered Program

EVAL_TASK_NAME="data_constrained_scaling_law" \
uv run python evaluator.py \
  results/data_constrained_scaling_law/gpt-5/run_1/best/best_program.py

⚙️ Configuration Guide

Customize task behavior by editing YAML config files in configs/. Here's the actual structure used by SLDBench:

Configuration File Structure

# Root-level settings
max_iterations: 50              # Number of evolution generations
checkpoint_interval: 1          # Save checkpoint every N iterations
log_level: "INFO"               # Logging verbosity
rand

SLD

Install / Use

README