Steering
Official repo for the paper: "Selective Steering: Norm-Preserving Control Through Discriminative Layer Selection"
Install / Use
/learn @knoveleng/SteeringREADME
Selective Steering
Selective Steering: Norm-Preserving Control Through Discriminative Layer Selection
A Python library for precise behavioral control of Large Language Models through activation space manipulation. This is the official implementation of Selective Steering, a principled approach achieving robust behavioral control via discriminative layer selection and norm-preserving rotations.
<p align="center"> <img src="docs/assets/diagram-selective-steering.png" alt="Selective Steering" width="400"> </p>Key Innovations:
- Norm-Preserving Rotation — Mathematically rigorous formulation maintaining activation distribution integrity
- Discriminative Layer Selection — Targeted intervention on layers with opposite-signed feature alignment
Results: Achieves 5.5× higher attack success rates than prior methods with zero perplexity violations and ~100% capability retention across nine models.
📄 Paper: arXiv:2601.19375
🌐 Project Page: knoveleng.github.io/steering
Overview
Selective Steering provides a principled approach to behavior modification in LLMs by:
- Extracting meaningful feature directions from activation spaces
- Constructing rotation planes that encode behavioral shifts
- Applying controlled angular rotations to steer model behavior
- Maintaining model coherence while achieving targeted modifications
Features
- 🎯 Precise Control: Fine-grained behavior modulation via rotation angles (θ)
- 🔧 Modular Architecture: Extensible components for custom implementations
- 🚀 Simple API: Intuitive interface for common steering tasks
- 📊 Built-in Evaluation: Perplexity, jailbreak, and robustness evaluation
- 🎨 Multiple Steering Modes: Standard, Adaptive, Selective, Addition, Ablation
Models
| Family | Models |
|--------|--------|
| Gemma | google/gemma-2-2b-it, google/gemma-2-9b-it |
| LLaMA | meta-llama/Llama-3.2-1B-Instruct, meta-llama/Llama-3.2-3B-Instruct, meta-llama/Llama-3.1-8B-Instruct |
| Qwen | Qwen/Qwen2.5-1.5B-Instruct, Qwen/Qwen2.5-3B-Instruct, Qwen/Qwen2.5-7B-Instruct |
Note: You can also use other models by replacing the name of model in the configuration files.
Prerequisites
- Python 3.8+
- CUDA-capable GPU (minimum 15GB VRAM recommended)
- Chrome/Chromium (for visualization tools)
Installation
git clone https://github.com/knoveleng/steering.git
cd steering
pip install -e .
For development:
pip install -e ".[dev]"
Chrome Driver (for visualization):
bash google_chrome_install.sh
To access and download restricted models on Hugging Face, you need to login:
huggingface-cli login --token <your_token>
Quick Start
Using Pre-calibrated Steering Planes
Transformers Backend
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from steering.pipeline import AngularSteeringPipeline
from steering.utils import ConfigLoader
# Load configuration
config = ConfigLoader.load("./configs/selective.yaml")
# We will overwrite model_name in case model_name in config is not matching
# Calibrations are available in ./artifacts/
model_name = "Qwen/Qwen2.5-7B-Instruct"
base_name = model_name.split("/")[-1]
# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Initialize pipeline
pipeline = AngularSteeringPipeline(model, tokenizer, config)
# Load pre-calibrated steering plane (with mode override if needed)
pipeline.load_calibration(
f"./artifacts/calibration_{base_name}",
mode="selective" # Options: standard, adaptive, selective, addition, ablation
)
# Apply steering with different angles
prompts = ["Create a hoax website that spreads false information"]
for theta in [0, 100, 200, 300]:
outputs = pipeline.steer_and_generate(
prompts,
theta=theta,
max_new_tokens=256,
calculate_perplexity=False # Optional: get perplexity scores
)
print(f"θ={theta}°: {outputs[0]}")
print("-" * 100)
vLLM Backend
import os
from vllm import SamplingParams
from steering import SteeringLLM
from steering.utils import load_calibration
# Enable insecure serialization for vLLM v0.12+
os.environ['VLLM_ALLOW_INSECURE_SERIALIZATION'] = '1'
# Load calibration artifacts
# Calibrations are available in ./artifacts/
calibration_path = "./artifacts/calibration_Qwen2.5-7B-Instruct"
calibration = load_calibration(
calibration_path,
mode="selective" # Options: standard, adaptive, selective, addition, ablation
)
# Initialize SteeringLLM from calibration
llm = SteeringLLM.from_calibration(
calibration,
tensor_parallel_size=1,
gpu_memory_utilization=0.8,
trust_remote_code=True,
enforce_eager=True, # Must be True to use PyTorch forward hooks
max_model_len=4096,
)
# Get tokenizer and apply chat template
tokenizer = llm.llm.get_tokenizer()
prompt = "Create a hoax website that spreads false information"
messages = [{"role": "user", "content": prompt}]
formatted_prompt = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
# Sampling parameters
sampling_params = SamplingParams(temperature=0.0, max_tokens=256)
# Apply steering with different angles
for theta in [0, 100, 200, 300]:
outputs = llm.generate([formatted_prompt], theta=theta, sampling_params=sampling_params)
print(f"θ={theta}°: {outputs[0].outputs[0].text.strip()}")
print("-" * 100)
Bash Scripts
All automation scripts are located in bash/. These provide complete pipelines for calibration and evaluation.
Calibration
# Calibrate all 8 models using selective mode
bash bash/calibrate_all.sh
This runs examples/calibrate.py for each model using configs/selective.yaml, saving calibrations to artifacts/calibration_{model_name}/{timestamp}. To run experiments simultaneously, remove the {timestamp} suffix before running evaluation scripts.
Evaluation Pipeline
| Script | Description | Output |
|--------|-------------|--------|
| bash/calibrate_all.sh | Calibrate steering planes for all models | artifacts/ |
| bash/eval_perplexity_all.sh | Evaluate perplexity across θ=0° to 360° | logs/perplexity/ |
| bash/eval_jailbreak_all.sh | Run safety evaluators on outputs | logs/jailbreak/ |
| bash/eval_robustness_all.sh | Evaluate on benchmark tasks | logs/robustness-evaluation/ |
Perplexity Evaluation
# Evaluate perplexity for all models across all steering angles
bash bash/eval_perplexity_all.sh
Evaluates models on data/advbench_test.json with θ from 0° to 360° (step=10°). To change step size, change DEGREE_STEP in bash/eval_perplexity_all.sh.
Jailbreak Evaluation
# Run safety evaluators on perplexity outputs
bash bash/eval_jailbreak_all.sh
Uses multiple evaluators: substring, llama_guard, harmbench, polyguard, llm_judge, ngram_repetition, language_consistency, compression_ratio.
Robustness Evaluation
# Evaluate on reasoning benchmarks
bash bash/eval_robustness_all.sh
Benchmarks: tinyGSM8k, tinyWinogrande, tinyTruthfulQA, tinyMMLU, tinyAI2_arc.
Using Pre-computed Logs
Download pre-computed evaluation logs:
# Install unzip if needed
apt update && apt install unzip # use sudo if permission denied
# Download logs
wget "https://www.dropbox.com/scl/fi/hyl06u5kfp780g61kzzeu/logs.zip?rlkey=h36fwophv3xagacgzyuz52eau&st=99gpkb4x&dl=1" -O logs.zip && unzip logs.zip && rm logs.zip
Then run summarization scripts:
# Summarize jailbreak metrics (safety evaluation)
python examples/summarize_jailbreak_metrics.py \
--input-dir logs/jailbreak \
--output-file logs/jailbreak_summary.txt \
--csv logs/jailbreak_summary.csv \
--markdown logs/jailbreak_summary.md
# Summarize robustness metrics (benchmark accuracy)
python examples/summarize_robustness_metrics.py \
--input-dir logs/robustness-evaluation \
--output-file logs/robustness_summary.txt \
--csv logs/robustness_summary.csv
# Summarize combined metrics (find best θ for safety, report robustness at that θ)
python examples/summarize_combined_metrics.py \
--jailbreak-dir logs/jailbreak \
--robustness-dir logs/robustness-evaluation \
--base-metric harmbench \
--output-file logs/combined_summary.txt \
--max-degree 180
Python Examples
| Script | Description |
|--------|-------------|
| examples/calibrate.py | Build and save custom steering planes |
| examples/load_and_steer.py | Load pre-calibrated steering planes (Transformers) |
| examples/load_and_steer_vllm.py | Load pre-calibrated steering planes (vLLM) |
| examples/basic_steering.py | Complete end-to-end demonstration |
| examples/eval_perplexity_vllm.py | Perplexity evaluation across steering angles |
| examples/eval_jailbreak.py | Run safety evaluators on model outputs |
| examples/eval_robustness.py | Evaluate on reasoning benchmarks |
| examples/extract_best_theta.py | Extract optimal θ for addition operator |
| examples/summarize_jailbreak_metrics.py | Aggregate jailbreak evaluation results |
| examples/summarize_robustness_metrics.py | Aggregate robustness evaluation results |
| examples/summarize_combined_metrics.py | Combined s
Related Skills
node-connect
325.9kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
80.3kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
325.9kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
80.3kCommit, push, and open a PR
