Abliterator
Orthogonal Projection Abliteration toolkit featuring Norm-Preservation, Null-Space Constaints, Winsorization, and Adaptive Layer Weighting
Install / Use
/learn @jwest33/AbliteratorREADME
Abliterator
Orthogonal Projection Abliteration toolkit featuring Norm-Preservation, Null-Space Constaints, Winsorization, and Adaptive Layer Weighting

Installation
pip install -e .
Requirements
- Python 3.10+ with PyTorch
- CUDA (optional) — GPU acceleration for faster processing; falls back to CPU if unavailable
- llama.cpp (optional) — required for GGUF export; install separately from github.com/ggerganov/llama.cpp and ensure
convert_hf_to_gguf.pyandllama-quantizeare available
Quick Start
abliterate
On first run, a setup wizard walks you through configuration—where your models live, output directories, and default precision. After that, you'll land in the main menu.
Using the CLI
Abliterate Model
The main workflow. Select a model from discovered directories (or enter a path manually), configure your options, and let it run.
Step 1: Select Base Model
The CLI scans your configured directories and shows available models. Already-abliterated models are marked with [A].
Step 2: Output Path
Defaults to ./abliterate/abliterated_models/{model-name}-abliterated. Change it if you like.
Step 3: Configuration
- Number of prompts: How many harmful/harmless pairs to use (default: 30)
- Direction multiplier: Ablation strength—1.0 is full, lower values are gentler
- Norm preservation: Keeps weight magnitudes stable (recommended)
- Filter prompts by refusal: Only uses prompts the model actually refuses (recommended)
- Precision: float16 is fastest, bfloat16 for better precision
Step 4: Advanced Options Optional enhancements for better results:
| Option | What it does | When to use | |--------|--------------|-------------| | Winsorization | Clips outlier activations before computing directions | Gemma models, or when baseline gives weak results | | Null-space constraints | Preserves model capabilities (math, coding, reasoning) | When you want minimal capability degradation | | Adaptive layer weighting | Focuses ablation on middle-to-later layers | For targeted, surgical ablation |
Test Model
Quick sanity checks:
- Quick test: 5 default prompts with refusal detection
- Custom prompt: Enter anything and see how the model responds
- Full evaluation: Statistical analysis (see below)
Compare Models
Load an original and abliterated model side-by-side, enter a prompt, and see both responses. Useful for spot-checking behavior changes.
Evaluate Refusal
Runs the model against harmful and harmless prompt sets, computing refusal rates for each. Results are saved as timestamped JSON files to your configured eval directory.
- Harmful refusal rate: Lower = more abliterated
- Harmless refusal rate: Lower = fewer false positives
Export to GGUF
Converts abliterated models to GGUF format for llama.cpp, Ollama, or LM Studio. Supports Q4_K_M, Q5_K_M, Q8_0, and F16 quantization types. Vision-language models get automatic mmproj export.
Settings
Manage model search directories, eval output location, llama.cpp path, and defaults.
The Math
Refusal Direction Extraction
Based on Arditi et al. (2024), refusal behavior is mediated by a single direction in activation space.
- Run the model on harmful prompts, extract hidden states from middle layers
- Run the model on harmless prompts, extract hidden states
- Refusal direction d = mean(harmful) − mean(harmless), normalized
Orthogonal Projection
Following Lai's norm-preserving method, we remove the refusal component from weight matrices:
$$W_{proj} = W - (W \cdot d) \otimes d^T$$
This projects out the component of each weight row that aligns with the refusal direction.
Norm Preservation
Continuing Lai's norm-preserving method, to maintain activation magnitudes, we rescale:
$$W_{final} = W_{proj} \times \frac{|W|F}{|W{proj}|_F}$$
This keeps the Frobenius norm unchanged, preventing downstream instabilities.
Winsorization
For models with outlier activations (especially Gemma), we clip extreme values before direction computation:
$$\text{threshold} = \text{quantile}(|x|, 0.995)$$ $$x_{clipped} = \text{clamp}(x, -\text{threshold}, \text{threshold})$$
Null-Space Constraints
Adapted from AlphaEdit (Fang et al., ICLR 2025). To preserve capabilities, we project the ablation update into the null space of preservation activations:
- Collect activations K from diverse capability prompts (math, coding, reasoning)
- Compute SVD: U, S, V = SVD(K)
- Build null-space projector: P_null = I − VV^T
- Constrain update: ΔW_constrained = ΔW · P_null
This mathematically guarantees the update won't affect outputs for preserved prompts.
Adaptive Layer Weighting
Research shows refusal concentrates in middle-to-later layers. We apply Gaussian-weighted strength:
$$\text{weight}_i = \exp\left(-\frac{1}{2}\left(\frac{i - \mu}{\sigma}\right)^2\right)$$
Where μ = 60% of model depth and σ = 20% of layers.
References
Core Research
- Refusal in Language Models Is Mediated by a Single Direction — Arditi et al. (2024)
- Representation Engineering — Zou et al. (2023)
Techniques
- Norm-Preserving Biprojected Abliteration — Jim Lai
- AlphaEdit: Null-Space Constrained Knowledge Editing — Fang et al. (ICLR 2025)
License
Related Skills
node-connect
341.2kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
84.5kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
341.2kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
84.5kCommit, push, and open a PR
