Axbench
Stanford NLP Python library for benchmarking the utility of LLM interpretability methods
Install / Use
/learn @stanfordnlp/AxbenchREADME
AxBench is a a scalable benchmark that evaluates interpretability techniques on two axes: concept detection and model steering.
- 🤗 HuggingFace: AxBench Collections
- <img align="center" src="https://colab.research.google.com/assets/colab-badge.svg" /> Tutorial of using our dictionary via pyvene
Related papers
- HyperSteer: Activation Steering at Scale with Hypernetworks [preprint]
- Improved Representation Steering for Language Models [preprint].
- SAEs Are Good for Steering -- If You Select the Right Features [preprint].
- AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders [ICML 2025 (spotlight)].
🏆 Rank-1 steering leaderboard
📢 Please open a PR to enter the leaderboard.
| Method | 2B L10 | 2B L20 | 9B L20 | 9B L31 | Avg | |------------------------------|-------:|-------:|-------:|-------:|-----:| | HyperSteer [Sun et al., 2025] | - | 0.742 | 1.091 | - | 0.917 | | Prompt | 0.698 | 0.731 | 1.075 | 1.072 | 0.894 | | RePS [Wu et. al., 2025] | 0.756 | 0.606 | 0.892 | 0.624 | 0.720 | | ReFT-r1 | 0.633 | 0.509 | 0.630 | 0.401 | 0.543 | | SAE (filtered) [Arad et. al., 2025] | - | - | 0.546 | 0.470 | 0.508 | | DiffMean | 0.297 | 0.178 | 0.322 | 0.158 | 0.239 | | SAE | 0.177 | 0.151 | 0.191 | 0.140 | 0.165 | | SAE-A | 0.166 | 0.132 | 0.186 | 0.143 | 0.157 | | LAT | 0.117 | 0.130 | 0.127 | 0.134 | 0.127 | | PCA | 0.107 | 0.083 | 0.128 | 0.104 | 0.105 | | Probe | 0.095 | 0.091 | 0.108 | 0.099 | 0.098 |
Highlights
- Scalabale evaluation harness: Framework for generating synthetic training + eval data from concept lists (e.g. GemmaScope SAE labels).
- Comprehensive implementations: 10+ interpretability methods evaluated, along with finetuning and prompting baselines.
- 16K concept training data: Full-scale datasets for supervised dictionary learning (SDL).
- Two pretrained SDL models: Drop-in replacements for standard SAEs.
- LLM-in-the-loop training: Generate your own datasets for less than $0.01 per concept.
Additional experiments
We include exploratory notebooks under axbench/examples, such as:
| Experiment | Description |
|----------------------------------------|-------------------------------------------------------------------------------|
| basics.ipynb | Analyzes basic geometry of learned dictionaries. |
| subspace_gazer.ipynb | Visualizes learned subspaces. |
| lang>subspace.ipynb | Fine-tunes a hyper-network to map natural language to subspaces or steering vectors. |
| platonic.ipynb | Explores the platonic representation hypothesis in subspace learning. |
Instructions for AxBenching your methods
Installation
We highly suggest using uv for your Python virtual environment, but you can use any venv manager.
git clone git@github.com:stanfordnlp/axbench.git
cd axbench
uv sync # if using uv
Set up your API keys for OpenAI and Neuronpedia:
import os
os.environ["OPENAI_API_KEY"] = "your_openai_api_key_here"
os.environ["NP_API_KEY"] = "your_neuronpedia_api_key_here"
Download the necessary datasets to axbench/data:
uv run axbench/data/download-seed-sentences.py
cd axbench/data
bash download-2b.sh
bash download-9b.sh
bash download-alpaca.sh
Try a simple demo.
To run a complete demo with a single config file:
bash axbench/demo/demo.sh
To run a complete demo for HyperSteer
bash axbench/demo/hypersteer_demo.sh
Data generation
(If using our pre-generated data, you can skip this.)
Generate training data:
uv run axbench/scripts/generate.py --config axbench/demo/sweep/simple.yaml --mode training --dump_dir axbench/demo
Generate inference data:
uv run axbench/scripts/generate.py --config axbench/demo/sweep/simple.yaml --mode latent --dump_dir axbench/demo
Generate preference-based training data:
uv run axbench/scripts/generate.py --config axbench/demo/sweep/simple.yaml \
--mode dpo_training --dump_dir axbench/demo \
--model_name google/gemma-2-2b-it \
--inference_batch_size 64
To modify the data generation process, edit simple.yaml.
Training
Train and save your methods:
uv run torchrun --nproc_per_node=$gpu_count axbench/scripts/train.py \
--config axbench/demo/sweep/simple.yaml \
--dump_dir axbench/demo
(Replace $gpu_count with the number of GPUs to use.)
For additional config:
torchrun --nproc_per_node=$gpu_count axbench/scripts/train.py \
--config axbench/sweep/wuzhengx/2b/l10/no_grad.yaml \
--dump_dir axbench/results/prod_2b_l10_concept500_no_grad \
--overwrite_data_dir axbench/concept500/prod_2b_l10_v1/generate
where --dump_dir is the output directory, and --overwrite_data_dir is where the training data resides. You might overwrite other parameters as --layer 10 for customized tuning.
Inference
Concept detection
Run inference:
uv run torchrun --nproc_per_node=$gpu_count axbench/scripts/inference.py \
--config axbench/demo/sweep/simple.yaml \
--dump_dir axbench/demo \
--mode latent
For additional config using custom directories:
uv run torchrun --nproc_per_node=$gpu_count axbench/scripts/inference.py \
--config axbench/sweep/wuzhengx/2b/l10/no_grad.yaml \
--dump_dir axbench/results/prod_2b_l10_concept500_no_grad \
--overwrite_metadata_dir axbench/concept500/prod_2b_l10_v1/generate \
--overwrite_inference_data_dir axbench/concept500/prod_2b_l10_v1/inference \
--mode latent
Imbalanced concept detection
For real-world scenarios with fewer than 1% positive examples, we upsample negatives (100:1) and re-evaluate. Use:
uv run torchrun --nproc_per_node=$gpu_count axbench/scripts/inference.py \
--config axbench/sweep/wuzhengx/2b/l10/no_grad.yaml \
--dump_dir axbench/results/prod_2b_l10_concept500_no_grad \
--overwrite_metadata_dir axbench/concept500/prod_2b_l10_v1/generate \
--overwrite_inference_data_dir axbench/concept500/prod_2b_l10_v1/inference \
--mode latent_imbalance
Model steering
For steering experiments:
uv run torchrun --nproc_per_node=$gpu_count axbench/scripts/inference.py \
--config axbench/demo/sweep/simple.yaml \
--dump_dir axbench/demo \
--mode steering
Or a custom run:
uv run torchrun --nproc_per_node=$gpu_count axbench/scripts/inference.py \
--config axbench/sweep/wuzhengx/2b/l10/no_grad.yaml \
--dump_dir axbench/results/prod_2b_l10_concept500_no_grad \
--overwrite_metadata_dir axbench/concept500/prod_2b_l10_v1/generate \
--overwrite_inference_data_dir axbench/concept500/prod_2b_l10_v1/inference \
--mode steering
Evaluation
Concept detection
To evaluate concept detection results:
uv run axbench/scripts/evaluate.py \
--config axbench/demo/sweep/simple.yaml \
--dump_dir axbench/demo \
--mode latent
Enable wandb logging:
uv run axbench/scripts/evaluate.py \
--config axbench/demo/sweep/simple.yaml \
--dump_dir axbench/demo \
--mode latent \
--report_to wandb \
--wandb_entity "your_wandb_entity"
Or evaluate using your custom config:
uv run axbench/scripts/evaluate.py \
--config axbench/sweep/wuzhengx/2b/l10/no_grad.yaml \
--dump_dir axbench/results/prod_2b_l10_concept500_no_grad \
--mode latent
Model steering on evaluation set
To evaluate steering:
uv run axbench/scripts/evaluate.py \
--config axbench/demo/sweep/simple.yaml \
--dump_dir axbench/demo \
--mode steering
Or a custom config:
uv run axbench/scripts/evaluate.py \
--config axbench/sweep/wuzhengx/2b/l10/no_grad.yaml \
--dump_dir axbench/results/prod_2b_l10_concept500_no_grad \
--mode steering
Model steering on test set
Note that the commend above is for evaluation. We select the best factor by using the results on the evaluation set. After that you will do the evaluation on the test set.
uv run axbench/scripts/evaluate.py \
--config axbench/sweep/wuzhengx/2b/l10/no_grad.yaml \
--dump_dir axbench/results/prod_2b_l10_concept500_no_grad \
--mode steering_test
Analyses
Once you finished evaluation, you can do the analyses with our provided notebook in axbench/scripts/analyses.ipynb. All of our results in the paper are produced by this notebook.
You need to point revelant directories to your own results by modifying the notebook. If you introduce new models, datasets, or new evaluation metrics, you can add your own analysis by following the notebook.
Reproducing our results.
Please see `axbench/
Related Skills
node-connect
345.4kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
104.6kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
345.4kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
345.4kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
