Abliterator

Orthogonal Projection Abliteration toolkit featuring Norm-Preservation, Null-Space Constaints, Winsorization, and Adaptive Layer Weighting

Generate Convert Improve

Install / Use

/learn @jwest33/Abliterator

About this skill

Quality Score

0/100

README

Abliterator

Orthogonal Projection Abliteration toolkit featuring Norm-Preservation, Null-Space Constaints, Winsorization, and Adaptive Layer Weighting

abliterator cli

Installation

pip install -e .

Requirements

Python 3.10+ with PyTorch
CUDA (optional) — GPU acceleration for faster processing; falls back to CPU if unavailable
llama.cpp (optional) — required for GGUF export; install separately from github.com/ggerganov/llama.cpp and ensure convert_hf_to_gguf.py and llama-quantize are available

Quick Start

abliterate

On first run, a setup wizard walks you through configuration—where your models live, output directories, and default precision. After that, you'll land in the main menu.

Using the CLI

Abliterate Model

The main workflow. Select a model from discovered directories (or enter a path manually), configure your options, and let it run.

Step 1: Select Base Model The CLI scans your configured directories and shows available models. Already-abliterated models are marked with [A].

Step 2: Output Path Defaults to ./abliterate/abliterated_models/{model-name}-abliterated. Change it if you like.

Step 3: Configuration

Number of prompts: How many harmful/harmless pairs to use (default: 30)
Direction multiplier: Ablation strength—1.0 is full, lower values are gentler
Norm preservation: Keeps weight magnitudes stable (recommended)
Filter prompts by refusal: Only uses prompts the model actually refuses (recommended)
Precision: float16 is fastest, bfloat16 for better precision

Step 4: Advanced Options Optional enhancements for better results:

| Option | What it does | When to use | |--------|--------------|-------------| | Winsorization | Clips outlier activations before computing directions | Gemma models, or when baseline gives weak results | | Null-space constraints | Preserves model capabilities (math, coding, reasoning) | When you want minimal capability degradation | | Adaptive layer weighting | Focuses ablation on middle-to-later layers | For targeted, surgical ablation |

Test Model

Quick sanity checks:

Quick test: 5 default prompts with refusal detection
Custom prompt: Enter anything and see how the model responds
Full evaluation: Statistical analysis (see below)

Compare Models

Load an original and abliterated model side-by-side, enter a prompt, and see both responses. Useful for spot-checking behavior changes.

Evaluate Refusal

Runs the model against harmful and harmless prompt sets, computing refusal rates for each. Results are saved as timestamped JSON files to your configured eval directory.

Harmful refusal rate: Lower = more abliterated
Harmless refusal rate: Lower = fewer false positives

Export to GGUF

Converts abliterated models to GGUF format for llama.cpp, Ollama, or LM Studio. Supports Q4_K_M, Q5_K_M, Q8_0, and F16 quantization types. Vision-language models get automatic mmproj export.

Settings

Manage model search directories, eval output location, llama.cpp path, and defaults.

The Math

Refusal Direction Extraction

Based on Arditi et al. (2024), refusal behavior is mediated by a single direction in activation space.

Run the model on harmful prompts, extract hidden states from middle layers
Run the model on harmless prompts, extract hidden states
Refusal direction d = mean(harmful) − mean(harmless), normalized

Orthogonal Projection

Following Lai's norm-preserving method, we remove the refusal component from weight matrices:

$$W_{proj} = W - (W \cdot d) \otimes d^T$$

This projects out the component of each weight row that aligns with the refusal direction.

Norm Preservation

Continuing Lai's norm-preserving method, to maintain activation magnitudes, we rescale:

$$W_{final} = W_{proj} \times \frac{|W|F}{|W{proj}|_F}$$

This keeps the Frobenius norm unchanged, preventing downstream instabilities.

Winsorization

For models with outlier activations (especially Gemma), we clip extreme values before direction computation:

$$\text{threshold} = \text{quantile}(|x|, 0.995)$$ $$x_{clipped} = \text{clamp}(x, -\text{threshold}, \text{threshold})$$

Null-Space Constraints

Adapted from AlphaEdit (Fang et al., ICLR 2025). To preserve capabilities, we project the ablation update into the null space of preservation activations:

Collect activations K from diverse capability prompts (math, coding, reasoning)
Compute SVD: U, S, V = SVD(K)
Build null-space projector: P_null = I − VV^T
Constrain update: ΔW_constrained = ΔW · P_null

This mathematically guarantees the update won't affect outputs for preserved prompts.

Adaptive Layer Weighting

Research shows refusal concentrates in middle-to-later layers. We apply Gaussian-weighted strength:

$$\text{weight}_i = \exp\left(-\frac{1}{2}\left(\frac{i - \mu}{\sigma}\right)^2\right)$$

Where μ = 60% of model depth and σ = 20% of layers.

References

Core Research

Refusal in Language Models Is Mediated by a Single Direction — Arditi et al. (2024)
Representation Engineering — Zou et al. (2023)

Techniques

Norm-Preserving Biprojected Abliteration — Jim Lai
AlphaEdit: Null-Space Constrained Knowledge Editing — Fang et al. (ICLR 2025)

License

MIT

Related Skills

node-connect

341.2k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

84.5k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

341.2k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

commit-push-pr

84.5k

Commit, push, and open a PR