ScalingOPT
ScalingOPT [LLM]
Install / Use
/learn @OpenEnvision-Lab/ScalingOPTREADME
ScalingOPT is a research-oriented PyTorch codebase for optimizer-centric scaling studies in large language model (LLM) training. It is part of the broader ScalingOpt community effort and is designed to make optimizer comparisons reproducible, fair, and ergonomically extensible.
Highlights
- Single entrypoint, 30+ optimizers — switch optimizers with
--optimizer <name>, no loop rewriting needed. - 17 model configs — LLaMA (9M–13B), GPT-2 (124M), Qwen3 (0.6B–1.7B) with full architecture details.
- 3 dataset pipelines — C4 (HF streaming), The Pile (local JSONL), OpenWebText (nanoGPT binary).
- Multi-GPU DDP — native
torchrundistributed training out of the box. - Single-GPU & low-memory — quantized weight training and per-layer optimizer variants for ≤12 GB VRAM.
- Post-training — SFT, DPO, GRPO via TRL integration; PPO/REINFORCE++ via OpenRLHF.
- Evaluation — one-command eval on 14+ benchmarks via lm-evaluation-harness.
- Logging — Weights & Biases + JSONL; tracks loss, perplexity, LR, throughput, and more.
Table of Contents
- Prerequisites
- Installation
- Quick Start
- Repository Structure
- Optimizers
- Datasets and Data Pipelines
- Model Configurations
- Training Recipes
- SFT / DPO / GRPO Training
- Evaluation
- Full CLI Reference
- Checkpointing and Resuming
- Logging
- License
- Citation and Attribution
Prerequisites
| Requirement | Minimum | Recommended | |-------------|---------|-------------| | Python | 3.7+ | 3.10+ | | PyTorch | 2.0+ | 2.2+ (with BF16 support) | | GPU | 1× (single-GPU mode) | 4–8× NVIDIA A100/H100 | | CUDA | 11.8+ | 12.1+ | | OS | Linux | Ubuntu 22.04+ |
Note: macOS/CPU can be used for code development and debugging, but GPU is required for actual training.
Installation
Step 1: Clone the repository
git clone https://github.com/OpenEnvision-Lab/ScalingOPT.git
cd ScalingOPT
Step 2: Create a virtual environment (recommended)
conda create -n scalingopt python=3.10 -y
conda activate scalingopt
Or with venv:
python -m venv venv
source venv/bin/activate
Step 3: Install PyTorch
Install PyTorch with CUDA support matching your driver version. Visit pytorch.org for the latest command, for example:
pip install torch --index-url https://download.pytorch.org/whl/cu121
Step 4: Install all dependencies
pip install -r requirements.txt
This installs the full dependency stack: transformers, datasets, wandb, tiktoken, loguru, bitsandbytes, evaluate, tqdm, schedulefree, and more.
Step 5: Install the optimizer library
pip install -e .
This installs scalingopt-torch in editable mode, making all optimizers in scalingopt_torch/ importable.
Verify installation
python -c "import scalingopt_torch; print('scalingopt_torch version:', scalingopt_torch.__version__)"
python -c "import torch; print('PyTorch:', torch.__version__); print('CUDA available:', torch.cuda.is_available())"
Quick Start
Train a LLaMA-60M model on C4 with AdamW (single node, 4 GPUs):
torchrun --standalone --nproc_per_node 4 main_pretrain.py \
--model_config configs/llama_60m.json \
--dataset allenai/c4 --dataset_config en \
--tokenizer t5-base \
--batch_size 32 --total_batch_size 512 \
--max_length 256 \
--lr 1e-3 --warmup_steps 1000 --num_training_steps 10000 \
--weight_decay 0.1 --grad_clipping 1.0 \
--optimizer adamw \
--eval_every 1000 --save_every 5000 \
--dtype bfloat16
Want to try a different optimizer? Just change --optimizer:
# APOLLO
--optimizer apollo_adamw --rank 256 --scale_type channel --proj random --update_proj_gap 200 --apollo_scale 1
# Muon
--optimizer muon
# Adam-Mini
--optimizer adam_mini
Repository Structure
ScalingOPT/
├── main_pretrain.py # Pretraining entrypoint (DDP via torchrun)
├── main_sft.py # SFT entrypoint (TRL SFTTrainer, full + LoRA)
├── main_dpo.py # DPO entrypoint (TRL DPOTrainer, full + LoRA)
├── main_grpo.py # GRPO entrypoint (TRL GRPOTrainer, full + LoRA)
├── main_eval.py # Evaluation on popular benchmarks (lm-eval-harness)
├── setup.py # Package setup for scalingopt-torch
├── requirements.txt # All dependencies (merged, deduplicated)
│
├── configs/ # Model architecture configs (JSON)
│ ├── llama_9m.json ... llama_13b.json # LLaMA: 9M to 13B params
│ ├── gpt2_124m.json # GPT-2: 124M params
│ └── qwen3_0.6b.json, qwen3_1.7b.json # Qwen3: 0.6B to 1.7B params
│
├── scalingopt_torch/ # Optimizer library (pip install -e .)
│ ├── __init__.py # Exports all optimizer classes (v1.0.3)
│ ├── adamw.py, adamw8bit.py # GaLore AdamW / 8-bit variants
│ ├── adafactor.py, adam_mini.py # Adafactor / Adam-Mini
│ ├── apollo.py, q_apollo.py # APOLLO / Quantized APOLLO
│ ├── muon.py, moonlight.py, mano.py # Muon / Moonlight / Mano
│ ├── soap.py, shampoo.py, sso.py # Second-order methods
│ ├── mars.py, mars_m.py # MARS / MARS-Muon
│ ├── spam.py, stable_spam.py # Sparse momentum methods
│ ├── lamb.py, lars.py # Large-batch optimizers
│ ├── lomo.py, adalomo.py # Low-memory optimizers
│ ├── conda.py, conda_projector.py # Compressed gradient projection
│ ├── prodigy.py, sophia.py, ... # Adaptive LR methods
│ └── *_projector.py # SVD / random projection utilities
│
├── utils/ # Training infrastructure
│ ├── optimizer_factory.py # Standalone optimizer factory (any framework)
│ ├── argparse.py # CLI argument parsing
│ ├── dataloader.py # Dataset loading & tokenization
│ ├── setup.py # Model & optimizer construction
│ ├── eval.py # Evaluation utilities
│ ├── training_utils.py # Schedulers & helpers
│ ├── modeling_llama.py # Local LLaMA implementation
│ ├── quantization.py # Int8 weight quantization
│ └── fake_quantization.py # Simulated quantization
│
├── data/
│ └── openwebtext/
│ └── prepare.py # OpenWebText → train.bin / val.bin
│
├── scripts/ # Ready-to-run experiment scripts
│ ├── pretrain_c4/ # C4 pretraining (LLaMA configs)
│ ├── pretrain_pile/ # Pile pretraining (Qwen configs)
│ ├── pretrain_openwebtext/ # OpenWebText pretraining (GPT-2)
│ ├── single_gpu/ # Single-GPU / low-memory runs
│ ├── sft_trl/ # SFT scripts (full + LoRA)
│ ├── dpo_trl/ # DPO scripts (full + LoRA)
│ ├── grpo_trl/ # GRPO scripts (full + LoRA)
│ ├── ppo_openrlhf/ # OpenRLHF PPO / GRPO / REINFORCE++
│ ├── eval/ # Evaluation scripts (lm-eval-harness)
│ └── example.sh # Checkpoint resume example
│
├── LICENSE # CC BY-NC 4.0
├── NOTICE
└── THIRD_PARTY_NOTICES.md # Upstream sources & licenses
Optimizers
All optimizers are selected via --optimizer <name> in main_pretrain.py. The authoritative list is in utils/setup.py.
Supported Optimizers
| Category | Optimizer Name(s) | Description |
|----------|-------------------|-------------|
| Baselines | adam, adamw, sgd, adafactor, adam8bit | Standard first-order methods |
| GaLore Family | galore_adamw, galore_adafactor, galore_adamw8bit | Gradient Low-Rank Projection |
| GaLore Per-Layer | galore_adamw8bit_per_layer | Layer-wise GaLore (saves memory) |
| APOLLO Family | apollo_adamw, q_apollo, q_apollo_per_layer | Approximate Gradient Scaling |
| Muon-based | muon, moonlight, mano | Orthogonal / matrix optimization |
| Second-order | soap, shampoo, sso, root | Preconditioned methods |
| Variance-reduced | mars, mars_m | MARS / MARS-Muon hybrid |
| Adaptive | adam_mini, ademamix, came, sophia, prodigy | Advanced adaptive LR methods |
| Large-batch | adan, lamb, lars | Designed for large-batch training |
| Low-memory | lomo, adalomo | Low-Memory Optimization |
| Sparse | spam, `stable
Related Skills
YC-Killer
2.7kA library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.
groundhog
398Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).
last30days-skill
16.5kAI agent skill that researches any topic across Reddit, X, YouTube, HN, Polymarket, and the web - then synthesizes a grounded summary
sec-edgar-agentkit
10AI agent toolkit for accessing and analyzing SEC EDGAR filing data. Build intelligent agents with LangChain, MCP-use, Gradio, Dify, and smolagents to analyze financial statements, insider trading, and company filings.
