SkillAgentSearch skills...

ScalingOPT

ScalingOPT [LLM]

Install / Use

/learn @OpenEnvision-Lab/ScalingOPT

README

<p align="center"> <div align="center"> <img width="243.5" height="186.7" alt="2a0ff7d09549aec917655f98551eaa32" src="https://github.com/user-attachments/assets/d0a7d40e-1be4-4a34-a943-e19caef400d4" /> </div> <h1 align="center">ScalingOPT (LLM)</h1> <p align="center"> <b>Optimizer-centric scaling studies for large language model</b> <br/> <a href="https://tianshijing.github.io/ScalingOpt/">Project Page</a> · <a href="#quick-start">Quick Start</a> · <a href="#optimizers">30+ Optimizers</a> · <a href="#datasets-and-data-pipelines">Datasets</a> · <a href="#training-recipes">Training Recipes</a> · <a href="#evaluation">Evaluation</a> </p> </p>

ScalingOPT is a research-oriented PyTorch codebase for optimizer-centric scaling studies in large language model (LLM) training. It is part of the broader ScalingOpt community effort and is designed to make optimizer comparisons reproducible, fair, and ergonomically extensible.

Highlights

  • Single entrypoint, 30+ optimizers — switch optimizers with --optimizer <name>, no loop rewriting needed.
  • 17 model configs — LLaMA (9M–13B), GPT-2 (124M), Qwen3 (0.6B–1.7B) with full architecture details.
  • 3 dataset pipelines — C4 (HF streaming), The Pile (local JSONL), OpenWebText (nanoGPT binary).
  • Multi-GPU DDP — native torchrun distributed training out of the box.
  • Single-GPU & low-memory — quantized weight training and per-layer optimizer variants for ≤12 GB VRAM.
  • Post-training — SFT, DPO, GRPO via TRL integration; PPO/REINFORCE++ via OpenRLHF.
  • Evaluation — one-command eval on 14+ benchmarks via lm-evaluation-harness.
  • Logging — Weights & Biases + JSONL; tracks loss, perplexity, LR, throughput, and more.

Table of Contents


Prerequisites

| Requirement | Minimum | Recommended | |-------------|---------|-------------| | Python | 3.7+ | 3.10+ | | PyTorch | 2.0+ | 2.2+ (with BF16 support) | | GPU | 1× (single-GPU mode) | 4–8× NVIDIA A100/H100 | | CUDA | 11.8+ | 12.1+ | | OS | Linux | Ubuntu 22.04+ |

Note: macOS/CPU can be used for code development and debugging, but GPU is required for actual training.

Installation

Step 1: Clone the repository

git clone https://github.com/OpenEnvision-Lab/ScalingOPT.git
cd ScalingOPT

Step 2: Create a virtual environment (recommended)

conda create -n scalingopt python=3.10 -y
conda activate scalingopt

Or with venv:

python -m venv venv
source venv/bin/activate

Step 3: Install PyTorch

Install PyTorch with CUDA support matching your driver version. Visit pytorch.org for the latest command, for example:

pip install torch --index-url https://download.pytorch.org/whl/cu121

Step 4: Install all dependencies

pip install -r requirements.txt

This installs the full dependency stack: transformers, datasets, wandb, tiktoken, loguru, bitsandbytes, evaluate, tqdm, schedulefree, and more.

Step 5: Install the optimizer library

pip install -e .

This installs scalingopt-torch in editable mode, making all optimizers in scalingopt_torch/ importable.

Verify installation

python -c "import scalingopt_torch; print('scalingopt_torch version:', scalingopt_torch.__version__)"
python -c "import torch; print('PyTorch:', torch.__version__); print('CUDA available:', torch.cuda.is_available())"

Quick Start

Train a LLaMA-60M model on C4 with AdamW (single node, 4 GPUs):

torchrun --standalone --nproc_per_node 4 main_pretrain.py \
  --model_config configs/llama_60m.json \
  --dataset allenai/c4 --dataset_config en \
  --tokenizer t5-base \
  --batch_size 32 --total_batch_size 512 \
  --max_length 256 \
  --lr 1e-3 --warmup_steps 1000 --num_training_steps 10000 \
  --weight_decay 0.1 --grad_clipping 1.0 \
  --optimizer adamw \
  --eval_every 1000 --save_every 5000 \
  --dtype bfloat16

Want to try a different optimizer? Just change --optimizer:

# APOLLO
--optimizer apollo_adamw --rank 256 --scale_type channel --proj random --update_proj_gap 200 --apollo_scale 1

# Muon
--optimizer muon

# Adam-Mini
--optimizer adam_mini

Repository Structure

ScalingOPT/
├── main_pretrain.py                 # Pretraining entrypoint (DDP via torchrun)
├── main_sft.py                      # SFT entrypoint (TRL SFTTrainer, full + LoRA)
├── main_dpo.py                      # DPO entrypoint (TRL DPOTrainer, full + LoRA)
├── main_grpo.py                     # GRPO entrypoint (TRL GRPOTrainer, full + LoRA)
├── main_eval.py                     # Evaluation on popular benchmarks (lm-eval-harness)
├── setup.py                         # Package setup for scalingopt-torch
├── requirements.txt                 # All dependencies (merged, deduplicated)
│
├── configs/                         # Model architecture configs (JSON)
│   ├── llama_9m.json ... llama_13b.json    # LLaMA: 9M to 13B params
│   ├── gpt2_124m.json                      # GPT-2: 124M params
│   └── qwen3_0.6b.json, qwen3_1.7b.json   # Qwen3: 0.6B to 1.7B params
│
├── scalingopt_torch/                # Optimizer library (pip install -e .)
│   ├── __init__.py                  #   Exports all optimizer classes (v1.0.3)
│   ├── adamw.py, adamw8bit.py       #   GaLore AdamW / 8-bit variants
│   ├── adafactor.py, adam_mini.py   #   Adafactor / Adam-Mini
│   ├── apollo.py, q_apollo.py       #   APOLLO / Quantized APOLLO
│   ├── muon.py, moonlight.py, mano.py  # Muon / Moonlight / Mano
│   ├── soap.py, shampoo.py, sso.py  #   Second-order methods
│   ├── mars.py, mars_m.py           #   MARS / MARS-Muon
│   ├── spam.py, stable_spam.py      #   Sparse momentum methods
│   ├── lamb.py, lars.py             #   Large-batch optimizers
│   ├── lomo.py, adalomo.py          #   Low-memory optimizers
│   ├── conda.py, conda_projector.py #   Compressed gradient projection
│   ├── prodigy.py, sophia.py, ...   #   Adaptive LR methods
│   └── *_projector.py               #   SVD / random projection utilities
│
├── utils/                           # Training infrastructure
│   ├── optimizer_factory.py         #   Standalone optimizer factory (any framework)
│   ├── argparse.py                  #   CLI argument parsing
│   ├── dataloader.py                #   Dataset loading & tokenization
│   ├── setup.py                     #   Model & optimizer construction
│   ├── eval.py                      #   Evaluation utilities
│   ├── training_utils.py            #   Schedulers & helpers
│   ├── modeling_llama.py            #   Local LLaMA implementation
│   ├── quantization.py              #   Int8 weight quantization
│   └── fake_quantization.py         #   Simulated quantization
│
├── data/
│   └── openwebtext/
│       └── prepare.py               # OpenWebText → train.bin / val.bin
│
├── scripts/                         # Ready-to-run experiment scripts
│   ├── pretrain_c4/                 #   C4 pretraining (LLaMA configs)
│   ├── pretrain_pile/               #   Pile pretraining (Qwen configs)
│   ├── pretrain_openwebtext/        #   OpenWebText pretraining (GPT-2)
│   ├── single_gpu/                  #   Single-GPU / low-memory runs
│   ├── sft_trl/                     #   SFT scripts (full + LoRA)
│   ├── dpo_trl/                     #   DPO scripts (full + LoRA)
│   ├── grpo_trl/                    #   GRPO scripts (full + LoRA)
│   ├── ppo_openrlhf/               #   OpenRLHF PPO / GRPO / REINFORCE++
│   ├── eval/                        #   Evaluation scripts (lm-eval-harness)
│   └── example.sh                   #   Checkpoint resume example
│
├── LICENSE                          # CC BY-NC 4.0
├── NOTICE
└── THIRD_PARTY_NOTICES.md           # Upstream sources & licenses

Optimizers

All optimizers are selected via --optimizer <name> in main_pretrain.py. The authoritative list is in utils/setup.py.

Supported Optimizers

| Category | Optimizer Name(s) | Description | |----------|-------------------|-------------| | Baselines | adam, adamw, sgd, adafactor, adam8bit | Standard first-order methods | | GaLore Family | galore_adamw, galore_adafactor, galore_adamw8bit | Gradient Low-Rank Projection | | GaLore Per-Layer | galore_adamw8bit_per_layer | Layer-wise GaLore (saves memory) | | APOLLO Family | apollo_adamw, q_apollo, q_apollo_per_layer | Approximate Gradient Scaling | | Muon-based | muon, moonlight, mano | Orthogonal / matrix optimization | | Second-order | soap, shampoo, sso, root | Preconditioned methods | | Variance-reduced | mars, mars_m | MARS / MARS-Muon hybrid | | Adaptive | adam_mini, ademamix, came, sophia, prodigy | Advanced adaptive LR methods | | Large-batch | adan, lamb, lars | Designed for large-batch training | | Low-memory | lomo, adalomo | Low-Memory Optimization | | Sparse | spam, `stable

Related Skills

View on GitHub
GitHub Stars9
CategoryEducation
Updated2d ago
Forks0

Languages

Python

Security Score

70/100

Audited on Mar 28, 2026

No findings