Pythia: Interpreting Transformers Across Time and Scale

This repository is for EleutherAI's project Pythia which combines interpretability analysis and scaling laws to understand how knowledge develops and evolves during training in autoregressive transformers. For detailed info on the models, their training, and their properties, please see our paper Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling.

The Pythia suite was developed with the explicit purpose of enabling research in interpretability, learning dynamics, and ethics and transparency for which existing model suites were inadequate. The key features of the Pythia suite are:

All models, data, and code used in the paper are publicly released, enabling full reproducibility of results. All results in our paper have been independently verified by at least one other lab.
All models feature 154 checkpoints saved throughout training, enabling the study of learning dynamics of LLMs.
All models were trained on the same data in the same order, enabling researchers to explore causal interventions on the training process.

At time of release, Pythia was the only model suite in the world to meet these desiderata. In fact, the 154 checkpoints we released for our 12B parameter models represented more partially trained checkpoints for each model than the rest of the world had ever released for all 12B+ models combined. Our work has inspired several others to create similar projects, including LLM360's Amber and K2-65B, AI2's OLMo, and Zyphra's BlackMamba.

Aside from the Pythia suite itself, this repository also acts as a hub containing information, code, and reproducibility instructions for the following papers:

Emergent and Predictable Memorization in Large Language Models [code] [paper]
PolyPythias: Stability and Outliers across Fifty Language Model Pre-Training Runs [code] [paper]

Changelog

[March 10, 2025] Added info for the PolyPythias paper.

[July 9, 2024] Substantially revamped the readme, including better historical contextualization and promoting lots of cool research people have done with Pythia. Also added links to subsequently trained models.

[November 2, 2023] We have added 14M and 31M models at the request of some researchers. We plan on training deduped versions of these models in the future.

[April 3, 2023] We have released a new version of all Pythia models, fixing various inconsistencies in the original suite. Please see Appendix B in the Pythia paper for details on the changes. The old models ("v0") remain available here and may be useful for ablation studies.

[January 20, 2023] On January 20, 2023, we chose to rename the Pythia model suite to include both embedding layer and unembedding layer parameters in our total parameter counts, in line with many other model suites and because we believe this convention better reflects the on-device memory usage of these models. We also discovered that due to a typo one of our models was smaller than we thought, and replaced it with a model of the intended size. See here for more details.

Models
- Multiple random seeds
- Changelog
Using Pythia
Benchmark Scores
Research Building on Pythia
Citation Details
License

Models

We train and release a suite of 8 model sizes on the Pile (paper, datasheet) as well as the Pile with deduplication applied. All 8 model sizes are trained on the exact same data, in the exact same order. Each model saw 299,892,736,000 ~= 300B tokens during training. This corresponds to just under 1 epoch on the Pile for "standard" models, and ~= 1.5 epochs on the deduped Pile (which contains 207B tokens in 1 epoch). All models are trained with mixed precision, using fp16 for all models except EleutherAI/pythia-1b which trained with bf16, because in fp16 the model experienced an irreconcilable loss spike late in training.

After our initial release, we trained 14M and 31M parameter models at the request of alignment researchers interested in scaling sparse autoencoders.

| Params | n_layers | d_model | n_heads | d_head | Batch Size | Learning Rate | Hugging Face Checkpoints | | ------ | -------- | ------- | ------- | ------ | ---------- | ------------- | ---------------------------------------------------------- | | 14M | 6 | 128 | 4 | 32 | 2M | 1.0e-3 | Standard | | 31M | 6 | 256 | 8 | 32 | 2M | 1.0e-3 | Standard | | 70M | 6 | 512 | 8 | 64 | 2M | 1.0e-3 | Standard, Deduped | | 160M | 12 | 768 | 12 | 64 | 2M | 6.0e-4 | Standard, Deduped| | 410M | 24 | 1024 | 16 | 64 | 2M | 3.0e-4 | Standard, Deduped| | 1B | 16 | 2048 | 8 | 256 | 2M | 3.0e-4 | Standard, Deduped | | 1.4B | 24 | 2048 | 16 | 128 | 2M | 2.0e-4 | Standard, Deduped| | 2.8B | 32 | 2560 | 32 | 80 | 2M | 1.6e-4 | Standard, Deduped| | 6.9B | 32 | 4096 | 32 | 128 | 2M | 1.2e-4 | Standard, Deduped| | 12B | 36 | 5120 | 40 | 128 | 2M | 1.2e-4 | Standard, Deduped |

To promote research on the learning dynamics of LLMs we make 154 checkpoints available for each model, representing steps 0 (initialization), 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1000, and then every 1,000 subsequent steps. We also upload the pre-tokenized data files and a script to reconstruct the dataloader as seen during training for all models. See Reproducing Training section for more details.

Config files used to train these models with the GPT-NeoX library can be found at the models/ directory within this repository, as well as in the GPT-NeoX library itself.

We made a mistake while originally training these models resulting in some inconsistencies across runs. We reran the entire model suite with these inconsistencies fixed and the original runs are available under the name EleutherAI/pythia-160m-v0. See the Pythia paper for further details on how the v0 models differ from the main suite.

The loss curves for all models are contained in our (messy!) wandb project here.

A rough and partial correspondence between models and wandb runs is given by: | Model | Wandb | | --------- | --------- | | Pythia-2.8b | Link | | Pythia-2.8b-deduped | Link | | Pythia-1b | Link | | Pythia-1.4b | Link | | Pythia-1.4b-deduped | Link | | Pythia-160m | Link | | Pythia-160m-deduped | Link |

Multiple random seeds

The random seed used to train the Pythia models is the GPT-NeoX default: 1234. To enable research into how randomness effects model behavior, we have been training more models with different random seeds. We have currently trained and released the following models using each random seed from 1 to 9.

Pythia 14M
Pythia 31M
Pythia 70M
Pythia 160M
Pythia 410M

All of these models are t

Pythia

Install / Use

README

Pythia: Interpreting Transformers Across Time and Scale

Changelog

Table of contents

Models

Multiple random seeds