ScalingOPT

ScalingOPT [LLM]

Generate Convert Improve

Install / Use

/learn @OpenEnvision-Lab/ScalingOPT

About this skill

Quality Score

0/100

README

<p align="center"> <div align="center"> <img width="243.5" height="186.7" alt="2a0ff7d09549aec917655f98551eaa32" src="https://github.com/user-attachments/assets/d0a7d40e-1be4-4a34-a943-e19caef400d4" /> </div> <h1 align="center">ScalingOPT (LLM)</h1> <p align="center"> <b>Optimizer-centric scaling studies for large language model</b> <br/> <a href="https://tianshijing.github.io/ScalingOpt/">Project Page</a> · <a href="#quick-start">Quick Start</a> · <a href="#optimizers">30+ Optimizers</a> · <a href="#datasets-and-data-pipelines">Datasets</a> · <a href="#training-recipes">Training Recipes</a> · <a href="#evaluation">Evaluation</a> </p> </p>

ScalingOPT is a research-oriented PyTorch codebase for optimizer-centric scaling studies in large language model (LLM) training. It is part of the broader ScalingOpt community effort and is designed to make optimizer comparisons reproducible, fair, and ergonomically extensible.

Highlights

Single entrypoint, 30+ optimizers — switch optimizers with --optimizer <name>, no loop rewriting needed.
17 model configs — LLaMA (9M–13B), GPT-2 (124M), Qwen3 (0.6B–1.7B) with full architecture details.
3 dataset pipelines — C4 (HF streaming), The Pile (local JSONL), OpenWebText (nanoGPT binary).
Multi-GPU DDP — native torchrun distributed training out of the box.
Single-GPU & low-memory — quantized weight training and per-layer optimizer variants for ≤12 GB VRAM.
Post-training — SFT, DPO, GRPO via TRL integration; PPO/REINFORCE++ via OpenRLHF.
Evaluation — one-command eval on 14+ benchmarks via lm-evaluation-harness.
Logging — Weights & Biases + JSONL; tracks loss, perplexity, LR, throughput, and more.

Prerequisites
Installation
Quick Start
Repository Structure
Optimizers
Datasets and Data Pipelines
Model Configurations
Training Recipes
SFT / DPO / GRPO Training
Evaluation
Full CLI Reference
Checkpointing and Resuming
Logging
License
Citation and Attribution

Prerequisites

| Requirement | Minimum | Recommended | |-------------|---------|-------------| | Python | 3.7+ | 3.10+ | | PyTorch | 2.0+ | 2.2+ (with BF16 support) | | GPU | 1× (single-GPU mode) | 4–8× NVIDIA A100/H100 | | CUDA | 11.8+ | 12.1+ | | OS | Linux | Ubuntu 22.04+ |

Note: macOS/CPU can be used for code development and debugging, but GPU is required for actual training.

Installation

Step 1: Clone the repository

git clone https://github.com/OpenEnvision-Lab/ScalingOPT.git
cd ScalingOPT

Step 2: Create a virtual environment (recommended)

conda create -n scalingopt python=3.10 -y
conda activate scalingopt

Or with venv:

python -m venv venv
source venv/bin/activate

Step 3: Install PyTorch

Install PyTorch with CUDA support matching your driver version. Visit pytorch.org for the latest command, for example:

pip install torch --index-url https://download.pytorch.org/whl/cu121

Step 4: Install all dependencies

pip install -r requirements.txt

This installs the full dependency stack: transformers, datasets, wandb, tiktoken, loguru, bitsandbytes, evaluate, tqdm, schedulefree, and more.

Step 5: Install the optimizer library

pip install -e .

This installs scalingopt-torch in editable mode, making all optimizers in scalingopt_torch/ importable.

Verify installation

python -c "import scalingopt_torch; print('scalingopt_torch version:', scalingopt_torch.__version__)"
python -c "import torch; print('PyTorch:', torch.__version__); print('CUDA available:', torch.cuda.is_available())"

Quick Start

Train a LLaMA-60M model on C4 with AdamW (single node, 4 GPUs):

torchrun --standalone --nproc_per_node 4 main_pretrain.py \
  --model_config configs/llama_60m.json \
  --dataset allenai/c4 --dataset_config en \
  --tokenizer t5-base \
  --batch_size 32 --total_batch_size 512 \
  --max_length 256 \
  --lr 1e-3 --warmup_steps 1000 --num_training_steps 10000 \
  --weight_decay 0.1 --grad_clipping 1.0 \
  --optimizer adamw \
  --eval_every 1000 --save_every 5000 \
  --dtype bfloat16

Want to try a different optimizer? Just change --optimizer:

# APOLLO
--optimizer apollo_adamw --rank 256 --scale_type channel --proj random --update_proj_gap 200 --apollo_scale 1

# Muon
--optimizer muon

# Adam-Mini
--optimizer adam_mini

Repository Structure

ScalingOPT/
├── main_pretrain.py                 # Pretraining entrypoint (DDP via torchrun)
├── main_sft.py                      # SFT entrypoint (TRL SFTTrainer, full + LoRA)
├── main_dpo.py                      # DPO entrypoint (TRL DPOTrainer, full + LoRA)
├── main_grpo.py                     # GRPO entrypoint (TRL GRPOTrainer, full + LoRA)
├── main_eval.py                     # Evaluation on popular benchmarks (lm-eval-harness)
├── setup.py                         # Package setup for scalingopt-torch
├── requirements.txt                 # All dependencies (merged, deduplicated)
│
├── configs/                         # Model architecture configs (JSON)
│   ├── llama_9m.json ... llama_13b.json    # LLaMA: 9M to 13B params
│   ├── gpt2_124m.json                      # GPT-2: 124M params
│   └── qwen3_0.6b.json, qwen3_1.7b.json   # Qwen3: 0.6B to 1.7B params
│
├── scalingopt_torch/                # Optimizer library (pip install -e .)
│   ├── __init__.py                  #   Exports all optimizer classes (v1.0.3)
│   ├── adamw.py, adamw8bit.py       #   GaLore AdamW / 8-bit variants
│   ├── adafactor.py, adam_mini.py   #   Adafactor / Adam-Mini
│   ├── apollo.py, q_apollo.py       #   APOLLO / Quantized APOLLO
│   ├── muon.py, moonlight.py, mano.py  # Muon / Moonlight / Mano
│   ├── soap.py, shampoo.py, sso.py  #   Second-order methods
│   ├── mars.py, mars_m.py           #   MARS / MARS-Muon
│   ├── spam.py, stable_spam.py      #   Sparse momentum methods
│   ├── lamb.py, lars.py             #   Large-batch optimizers
│   ├── lomo.py, adalomo.py          #   Low-memory optimizers
│   ├── conda.py, conda_projector.py #   Compressed gradient projection
│   ├── prodigy.py, sophia.py, ...   #   Adaptive LR methods
│   └── *_projector.py               #   SVD / random projection utilities
│
├── utils/                           # Training infrastructure
│   ├── optimizer_factory.py         #   Standalone optimizer factory (any framework)
│   ├── argparse.py                  #   CLI argument parsing
│   ├── dataloader.py                #   Dataset loading & tokenization
│   ├── setup.py                     #   Model & optimizer construction
│   ├── eval.py                      #   Evaluation utilities
│   ├── training_utils.py            #   Schedulers & helpers
│   ├── modeling_llama.py            #   Local LLaMA implementation
│   ├── quantization.py              #   Int8 weight quantization
│   └── fake_quantization.py         #   Simulated quantization
│
├── data/
│   └── openwebtext/
│       └── prepare.py               # OpenWebText → train.bin / val.bin
│
├── scripts/                         # Ready-to-run experiment scripts
│   ├── pretrain_c4/                 #   C4 pretraining (LLaMA configs)
│   ├── pretrain_pile/               #   Pile pretraining (Qwen configs)
│   ├── pretrain_openwebtext/        #   OpenWebText pretraining (GPT-2)
│   ├── single_gpu/                  #   Single-GPU / low-memory runs
│   ├── sft_trl/                     #   SFT scripts (full + LoRA)
│   ├── dpo_trl/                     #   DPO scripts (full + LoRA)
│   ├── grpo_trl/                    #   GRPO scripts (full + LoRA)
│   ├── ppo_openrlhf/               #   OpenRLHF PPO / GRPO / REINFORCE++
│   ├── eval/                        #   Evaluation scripts (lm-eval-harness)
│   └── example.sh                   #   Checkpoint resume example
│
├── LICENSE                          # CC BY-NC 4.0
├── NOTICE
└── THIRD_PARTY_NOTICES.md           # Upstream sources & licenses

Optimizers

All optimizers are selected via --optimizer <name> in main_pretrain.py. The authoritative list is in utils/setup.py.

Supported Optimizers

| Category | Optimizer Name(s) | Description | |----------|-------------------|-------------| | Baselines | adam, adamw, sgd, adafactor, adam8bit | Standard first-order methods | | GaLore Family | galore_adamw, galore_adafactor, galore_adamw8bit | Gradient Low-Rank Projection | | GaLore Per-Layer | galore_adamw8bit_per_layer | Layer-wise GaLore (saves memory) | | APOLLO Family | apollo_adamw, q_apollo, q_apollo_per_layer | Approximate Gradient Scaling | | Muon-based | muon, moonlight, mano | Orthogonal / matrix optimization | | Second-order | soap, shampoo, sso, root | Preconditioned methods | | Variance-reduced | mars, mars_m | MARS / MARS-Muon hybrid | | Adaptive | adam_mini, ademamix, came, sophia, prodigy | Advanced adaptive LR methods | | Large-batch | adan, lamb, lars | Designed for large-batch training | | Low-memory | lomo, adalomo | Low-Memory Optimization | | Sparse | spam, `stable

Related Skills

YC-Killer

2.7k

A library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.

groundhog

398

Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).

last30days-skill

16.5k

AI agent skill that researches any topic across Reddit, X, YouTube, HN, Polymarket, and the web - then synthesizes a grounded summary

sec-edgar-agentkit

AI agent toolkit for accessing and analyzing SEC EDGAR filing data. Build intelligent agents with LangChain, MCP-use, Gradio, Dify, and smolagents to analyze financial statements, insider trading, and company filings.

OpenEnvision-Lab

View profile

View on GitHub

GitHub Stars9

CategoryEducation

Updated2d ago

Forks0

OpenEnvision-Lab/ScalingOPT

Languages

Python

Security Score

70/100

Audited on Mar 28, 2026

No findings