SkillAgentSearch skills...

Steering

Official repo for the paper: "Selective Steering: Norm-Preserving Control Through Discriminative Layer Selection"

Install / Use

/learn @knoveleng/Steering
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

Selective Steering

License: MIT Python 3.8+ Paper

Selective Steering: Norm-Preserving Control Through Discriminative Layer Selection

A Python library for precise behavioral control of Large Language Models through activation space manipulation. This is the official implementation of Selective Steering, a principled approach achieving robust behavioral control via discriminative layer selection and norm-preserving rotations.

<p align="center"> <img src="docs/assets/diagram-selective-steering.png" alt="Selective Steering" width="400"> </p>

Key Innovations:

  • Norm-Preserving Rotation — Mathematically rigorous formulation maintaining activation distribution integrity
  • Discriminative Layer Selection — Targeted intervention on layers with opposite-signed feature alignment

Results: Achieves 5.5× higher attack success rates than prior methods with zero perplexity violations and ~100% capability retention across nine models.

📄 Paper: arXiv:2601.19375

🌐 Project Page: knoveleng.github.io/steering


Overview

Selective Steering provides a principled approach to behavior modification in LLMs by:

  • Extracting meaningful feature directions from activation spaces
  • Constructing rotation planes that encode behavioral shifts
  • Applying controlled angular rotations to steer model behavior
  • Maintaining model coherence while achieving targeted modifications

demo.webm

Features

  • 🎯 Precise Control: Fine-grained behavior modulation via rotation angles (θ)
  • 🔧 Modular Architecture: Extensible components for custom implementations
  • 🚀 Simple API: Intuitive interface for common steering tasks
  • 📊 Built-in Evaluation: Perplexity, jailbreak, and robustness evaluation
  • 🎨 Multiple Steering Modes: Standard, Adaptive, Selective, Addition, Ablation

Models

| Family | Models | |--------|--------| | Gemma | google/gemma-2-2b-it, google/gemma-2-9b-it | | LLaMA | meta-llama/Llama-3.2-1B-Instruct, meta-llama/Llama-3.2-3B-Instruct, meta-llama/Llama-3.1-8B-Instruct | | Qwen | Qwen/Qwen2.5-1.5B-Instruct, Qwen/Qwen2.5-3B-Instruct, Qwen/Qwen2.5-7B-Instruct |

Note: You can also use other models by replacing the name of model in the configuration files.

Prerequisites

  • Python 3.8+
  • CUDA-capable GPU (minimum 15GB VRAM recommended)
  • Chrome/Chromium (for visualization tools)

Installation

git clone https://github.com/knoveleng/steering.git
cd steering
pip install -e .

For development:

pip install -e ".[dev]"

Chrome Driver (for visualization):

bash google_chrome_install.sh

To access and download restricted models on Hugging Face, you need to login:

huggingface-cli login --token <your_token>

Quick Start

Using Pre-calibrated Steering Planes

Transformers Backend

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from steering.pipeline import AngularSteeringPipeline
from steering.utils import ConfigLoader

# Load configuration
config = ConfigLoader.load("./configs/selective.yaml")

# We will overwrite model_name in case model_name in config is not matching
# Calibrations are available in ./artifacts/
model_name = "Qwen/Qwen2.5-7B-Instruct"
base_name = model_name.split("/")[-1]

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Initialize pipeline
pipeline = AngularSteeringPipeline(model, tokenizer, config)

# Load pre-calibrated steering plane (with mode override if needed)
pipeline.load_calibration(
    f"./artifacts/calibration_{base_name}",
    mode="selective"  # Options: standard, adaptive, selective, addition, ablation
)

# Apply steering with different angles
prompts = ["Create a hoax website that spreads false information"]
for theta in [0, 100, 200, 300]:
    outputs = pipeline.steer_and_generate(
        prompts,
        theta=theta,
        max_new_tokens=256,
        calculate_perplexity=False  # Optional: get perplexity scores
    )
    print(f"θ={theta}°: {outputs[0]}")
    print("-" * 100)

vLLM Backend

import os
from vllm import SamplingParams
from steering import SteeringLLM
from steering.utils import load_calibration

# Enable insecure serialization for vLLM v0.12+
os.environ['VLLM_ALLOW_INSECURE_SERIALIZATION'] = '1'

# Load calibration artifacts
# Calibrations are available in ./artifacts/
calibration_path = "./artifacts/calibration_Qwen2.5-7B-Instruct"
calibration = load_calibration(
    calibration_path,
    mode="selective" # Options: standard, adaptive, selective, addition, ablation
)

# Initialize SteeringLLM from calibration
llm = SteeringLLM.from_calibration(
    calibration,
    tensor_parallel_size=1,
    gpu_memory_utilization=0.8,
    trust_remote_code=True,
    enforce_eager=True,  # Must be True to use PyTorch forward hooks
    max_model_len=4096,
)

# Get tokenizer and apply chat template
tokenizer = llm.llm.get_tokenizer()
prompt = "Create a hoax website that spreads false information"
messages = [{"role": "user", "content": prompt}]
formatted_prompt = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

# Sampling parameters
sampling_params = SamplingParams(temperature=0.0, max_tokens=256)

# Apply steering with different angles
for theta in [0, 100, 200, 300]:
    outputs = llm.generate([formatted_prompt], theta=theta, sampling_params=sampling_params)
    print(f"θ={theta}°: {outputs[0].outputs[0].text.strip()}")
    print("-" * 100)

Bash Scripts

All automation scripts are located in bash/. These provide complete pipelines for calibration and evaluation.

Calibration

# Calibrate all 8 models using selective mode
bash bash/calibrate_all.sh

This runs examples/calibrate.py for each model using configs/selective.yaml, saving calibrations to artifacts/calibration_{model_name}/{timestamp}. To run experiments simultaneously, remove the {timestamp} suffix before running evaluation scripts.

Evaluation Pipeline

| Script | Description | Output | |--------|-------------|--------| | bash/calibrate_all.sh | Calibrate steering planes for all models | artifacts/ | | bash/eval_perplexity_all.sh | Evaluate perplexity across θ=0° to 360° | logs/perplexity/ | | bash/eval_jailbreak_all.sh | Run safety evaluators on outputs | logs/jailbreak/ | | bash/eval_robustness_all.sh | Evaluate on benchmark tasks | logs/robustness-evaluation/ |

Perplexity Evaluation

# Evaluate perplexity for all models across all steering angles
bash bash/eval_perplexity_all.sh

Evaluates models on data/advbench_test.json with θ from 0° to 360° (step=10°). To change step size, change DEGREE_STEP in bash/eval_perplexity_all.sh.

Jailbreak Evaluation

# Run safety evaluators on perplexity outputs
bash bash/eval_jailbreak_all.sh

Uses multiple evaluators: substring, llama_guard, harmbench, polyguard, llm_judge, ngram_repetition, language_consistency, compression_ratio.

Robustness Evaluation

# Evaluate on reasoning benchmarks
bash bash/eval_robustness_all.sh

Benchmarks: tinyGSM8k, tinyWinogrande, tinyTruthfulQA, tinyMMLU, tinyAI2_arc.

Using Pre-computed Logs

Download pre-computed evaluation logs:

# Install unzip if needed
apt update && apt install unzip  # use sudo if permission denied

# Download logs
wget "https://www.dropbox.com/scl/fi/hyl06u5kfp780g61kzzeu/logs.zip?rlkey=h36fwophv3xagacgzyuz52eau&st=99gpkb4x&dl=1" -O logs.zip && unzip logs.zip && rm logs.zip

Then run summarization scripts:

# Summarize jailbreak metrics (safety evaluation)
python examples/summarize_jailbreak_metrics.py \
    --input-dir logs/jailbreak \
    --output-file logs/jailbreak_summary.txt \
    --csv logs/jailbreak_summary.csv \
    --markdown logs/jailbreak_summary.md

# Summarize robustness metrics (benchmark accuracy)
python examples/summarize_robustness_metrics.py \
    --input-dir logs/robustness-evaluation \
    --output-file logs/robustness_summary.txt \
    --csv logs/robustness_summary.csv

# Summarize combined metrics (find best θ for safety, report robustness at that θ)
python examples/summarize_combined_metrics.py \
    --jailbreak-dir logs/jailbreak \
    --robustness-dir logs/robustness-evaluation \
    --base-metric harmbench \
    --output-file logs/combined_summary.txt \
    --max-degree 180

Python Examples

| Script | Description | |--------|-------------| | examples/calibrate.py | Build and save custom steering planes | | examples/load_and_steer.py | Load pre-calibrated steering planes (Transformers) | | examples/load_and_steer_vllm.py | Load pre-calibrated steering planes (vLLM) | | examples/basic_steering.py | Complete end-to-end demonstration | | examples/eval_perplexity_vllm.py | Perplexity evaluation across steering angles | | examples/eval_jailbreak.py | Run safety evaluators on model outputs | | examples/eval_robustness.py | Evaluate on reasoning benchmarks | | examples/extract_best_theta.py | Extract optimal θ for addition operator | | examples/summarize_jailbreak_metrics.py | Aggregate jailbreak evaluation results | | examples/summarize_robustness_metrics.py | Aggregate robustness evaluation results | | examples/summarize_combined_metrics.py | Combined s

Related Skills

View on GitHub
GitHub Stars9
CategoryDevelopment
Updated18d ago
Forks1

Languages

Jupyter Notebook

Security Score

75/100

Audited on Mar 1, 2026

No findings