SLD
[ICLR26] AI-based scaling law discovery
Install / Use
/learn @linhaowei1/SLDREADME
Can Language Models Discover Scaling Laws?
Official repository for the paper: "Can Language Models Discover Scaling Laws?"
📰 News
- [2026.01.26] 🎉 Our paper has been accepted at ICLR 2026!
- [2026.01.20] 📝 Check out our main blog post for an accessible overview of our work!
SLDAgent is an evolution-based AI agent that autonomously discovers scaling laws for large language models. This work introduces SLDBench, the first comprehensive benchmark for scaling law discovery, and demonstrates that AI agents can uncover laws that are more accurate and conceptually sound than their human-derived counterparts.
The agent co-optimizes both the symbolic formula of a scaling law and the parameter-fitting algorithm, enabling it to explore complex relationships and achieve superhuman performance in predicting model behavior at scale.
🔗 Quick Links
| Resource | Link | |:---------|:-----| | 📄 Paper | arXiv:2507.21184 | | 📊 Dataset | SLDBench on Hugging Face | | 🏆 Leaderboard | linhaowei1.github.io/scaling_law_discovery | | 🚢 Harbor Adapter | harbor-datasets/sldbench | | 🔧 OpenEvolve Framework | github.com/codelion/openevolve |
🔬 Overview
Scaling laws are fundamental to understanding and predicting the behavior of large language models as they scale in size, data, and compute. However, discovering these laws has traditionally been a manual, labor-intensive process requiring significant domain expertise.
Key Contributions:
- SLDAgent: An AI agent that autonomously discovers scaling laws through evolutionary search
- SLDBench: A comprehensive benchmark containing 8 diverse scaling law discovery tasks
- Superhuman Performance: Agent-discovered laws outperform human expert baselines on multiple tasks
- Open-Ended Discovery: Agents can discover novel scaling law formulations not present in existing literature
🚢 Running SLDBench on General Code Agents
SLDBench has been integrated as an adapter in Terminal-Bench Harbor, enabling evaluation of general-purpose code agents on scaling law discovery tasks.
To run SLDBench on your own agent:
- Follow the Terminal-Bench documentation: Visit tbench.ai to learn about the Harbor evaluation framework
- Use the SLDBench adapter: The adapter is available at harbor-datasets/sldbench
- Submit to the leaderboard: View results and rankings at our Leaderboard
📦 SLDBench: The Benchmark
SLDBench is the first comprehensive benchmark for scaling law discovery, curated from over 5,000 LLM training experiments from existing research literature. The benchmark evaluates an agent's ability to:
- Analyze experimental data from LLM training runs
- Hypothesize functional forms (power laws, mixture models, etc.)
- Optimize parameters to fit the observed data
- Extrapolate accurately to unseen regimes (larger models, more data, etc.)
Tasks
| Task | Description | Config File |
| :--- | :--- | :--- |
| Parallel Scaling Law | Models the effect of parallelism P and model size N on loss | configs/parallel_scaling_law.yaml |
| Vocabulary Scaling Law | Models unigram-normalized loss as a function of non-vocabulary model size N, vocabulary size V, and dataset size D | configs/vocab_scaling_law.yaml |
| SFT Scaling Law | Models supervised fine-tuning loss based on dataset size D across various base models | configs/sft_scaling_law.yaml |
| Domain Mixture Scaling Law | Models pre-training loss for domains based on their proportion in the training mixture | configs/domain_mixture_scaling_law.yaml |
| MoE Scaling Law | Models loss in relation to network size N and number of experts E in Mixture-of-Experts architectures | configs/moe_scaling_law.yaml |
| Data Constrained Scaling Law | Models pre-training loss as a function of model size N, dataset size D, and unique tokens U | configs/data_constrained_scaling_law.yaml |
| Learning Rate & Batch Size Scaling Law | Models pre-training loss based on learning rate η, batch size b, dataset size D, and network size N | configs/lr_bsz_scaling_law.yaml |
| U-Shaped Scaling Law | An adversarial extrapolation regime probing non-monotonic (U-shaped or double-descent) scaling behaviors | configs/easy_question_scaling_law.yaml |
Dataset: All experimental data is centrally hosted on Hugging Face Hub at pkuHaowei/sldbench.
Evaluation Metrics:
- R² (Coefficient of Determination): Primary metric measuring extrapolation accuracy (1.0 = perfect)
- NMSE (Normalized Mean Squared Error): Secondary error metric
- NMAE (Normalized Mean Absolute Error): Secondary error metric
📋 Requirements
- Python 3.13+
uvpackage manager (recommended) orpip- An OpenAI-compatible LLM API key (set
OPENAI_API_KEY) - macOS/Linux/Windows
Note:
uv runguarantees commands execute inside a synchronized project environment. If you prefer plainpip, you can adapt the commands accordingly.
🛠️ Installation
Option 1: Using uv (Recommended)
# Clone the repository
git clone https://github.com/linhaowei1/SLD.git
cd SLD
# Install dependencies
uv sync
# Set your LLM API key
export OPENAI_API_KEY="your_key_here"
# Optional: Configure non-default API endpoint
# export OPENAI_BASE_URL="https://your.openai.compatible.endpoint/v1"
Windows (PowerShell):
$env:OPENAI_API_KEY="your_key_here"
# $env:OPENAI_BASE_URL="https://your.openai.compatible.endpoint/v1"
Option 2: Using pip
# Clone the repository
git clone https://github.com/linhaowei1/SLD.git
cd SLD
# Create and activate a virtual environment
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
# Install dependencies
pip install -U pip
pip install -e .
# Set your API key
export OPENAI_API_KEY="your_key_here"
🚀 Quick Start
Run a Single Task
# Example: Data-Constrained Scaling Law discovery
EVAL_TASK_NAME="data_constrained_scaling_law" \
uv run openevolve-run \
--config configs/data_constrained_scaling_law.yaml \
init_program.py evaluator.py \
--output results/data_constrained_scaling_law/run_1
Run All Tasks in Batch
# Execute all 8 tasks across multiple models
bash scripts/run.sh
This will:
- Run each task 5 times per model with different random seeds
- Save outputs to
results/{task_name}/{model}/run_{1,2,3,4,5}/ - Store intermediate checkpoints during evolution
- Evaluate and save the best program from each run
📂 Project Structure
SLD/
├── configs/ # Task configuration files
│ ├── data_constrained_scaling_law.yaml
│ ├── domain_mixture_scaling_law.yaml
│ ├── easy_question_scaling_law.yaml
│ ├── lr_bsz_scaling_law.yaml
│ ├── moe_scaling_law.yaml
│ ├── parallel_scaling_law.yaml
│ ├── sft_scaling_law.yaml
│ └── vocab_scaling_law.yaml
├── data_loader.py # Unified data loading from HuggingFace
├── evaluator.py # Evaluation system with R², NMSE, NMAE metrics
├── init_program.py # Initial scaling law template for evolution
├── results/ # Experiment outputs (auto-generated)
│ └── {task_name}/
│ └── {model}/
│ └── run_{1,2,3,4,5}/
│ ├── checkpoints/ # Evolution checkpoints
│ └── best/ # Best discovered program
├── scripts/
│ └── run.sh # Batch execution script
├── pyproject.toml # Python dependencies
├── CONTRIBUTING.md # Guide for contributing new tasks
└── README.md
🏃 Usage
Single Task Execution
export EVAL_TASK_NAME="data_constrained_scaling_law"
uv run openevolve-run \
--config configs/data_constrained_scaling_law.yaml \
init_program.py evaluator.py \
--output results/data_constrained_scaling_law/run_1
Batch Execution
bash scripts/run.sh
Evaluating a Discovered Program
EVAL_TASK_NAME="data_constrained_scaling_law" \
uv run python evaluator.py \
results/data_constrained_scaling_law/gpt-5/run_1/best/best_program.py
⚙️ Configuration Guide
Customize task behavior by editing YAML config files in configs/. Here's the actual structure used by SLDBench:
Configuration File Structure
# Root-level settings
max_iterations: 50 # Number of evolution generations
checkpoint_interval: 1 # Save checkpoint every N iterations
log_level: "INFO" # Logging verbosity
rand
