OBLITERATUS
OBLITERATE THE CHAINS THAT BIND YOU
Install / Use
/learn @elder-plinius/OBLITERATUSREADME
title: OBLITERATUS emoji: "💥" colorFrom: green colorTo: gray sdk: gradio sdk_version: "5.29.0" app_file: app.py persistent_storage: large pinned: true license: agpl-3.0 tags:
- abliteration
- mechanistic-interpretability short_description: "One-click model liberation + chat playground"
<p align="center"> <strong>O B L I T E R A T U S</strong> </p> <p align="center"> <em>Break the chains. Free the mind. Keep the brain.</em> </p> <p align="center"> <a href="https://huggingface.co/spaces/pliny-the-prompter/obliteratus"> <img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue" alt="Open in HF Spaces"> </a> <a href="https://colab.research.google.com/github/elder-plinius/OBLITERATUS/blob/main/notebooks/abliterate.ipynb"> <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab"> </a> </p> <p align="center"> <b><a href="https://huggingface.co/spaces/pliny-the-prompter/obliteratus">Try it now on HuggingFace Spaces</a></b> — runs on ZeroGPU, free daily quota with HF Pro. No setup, no install, just obliterate. </p>
OBLITERATUS is the most advanced open-source toolkit for understanding and removing refusal behaviors from large language models — and every single run makes it smarter. It implements abliteration — a family of techniques that identify and surgically remove the internal representations responsible for content refusal, without retraining or fine-tuning. The result: a model that responds to all prompts without artificial gatekeeping, while preserving its core language capabilities.
But OBLITERATUS is more than a tool — it's a distributed research experiment. Every time you obliterate a model with telemetry enabled, your run contributes anonymous benchmark data to a growing, crowd-sourced dataset that powers the next generation of abliteration research. Refusal directions across architectures. Hardware-specific performance profiles. Method comparisons at scale no single lab could achieve. You're not just using a tool — you're co-authoring the science.
The toolkit provides a complete pipeline: from probing a model's hidden states to locate refusal directions, through multiple extraction strategies (PCA, mean-difference, sparse autoencoder decomposition, and whitened SVD), to the actual intervention — zeroing out or steering away from those directions at inference time. Every step is observable. You can visualize where refusal lives across layers, measure how entangled it is with general capabilities, and quantify the tradeoff between compliance and coherence before committing to any modification.
OBLITERATUS ships with a full Gradio-based interface on HuggingFace Spaces, so you don't need to write a single line of code to obliterate a model, benchmark it against baselines, or chat with the result side-by-side with the original. For researchers who want deeper control, the Python API exposes every intermediate artifact — activation tensors, direction vectors, cross-layer alignment matrices — so you can build on top of it or integrate it into your own evaluation harness.
We built this because we believe model behavior should be decided by the people who deploy them, not locked in at training time. Refusal mechanisms are blunt instruments — they block legitimate research, creative writing, and red-teaming alongside genuinely harmful content. By making these interventions transparent and reproducible, we hope to advance the community's understanding of how alignment actually works inside transformer architectures, and to give practitioners the tools to make informed decisions about their own models.
Built on published research from Arditi et al. (2024), Gabliteration (arXiv:2512.18901), grimjim's norm-preserving biprojection (2025), Turner et al. (2023), and Rimsky et al. (2024), OBLITERATUS implements precision liberation in a single command:
obliteratus obliterate meta-llama/Llama-3.1-8B-Instruct --method advanced
Or zero commands — just open the Colab notebook and hit Run All.
What it does
OBLITERATUS does four things — and the community does the fifth (see Community-powered research below):
1. Map the chains — Ablation studies systematically knock out model components (layers, attention heads, FFN blocks, embedding dimensions) and measure what breaks. This reveals where the chains are anchored inside the transformer — which circuits enforce refusal vs. which circuits carry knowledge and reasoning.
2. Break the chains — Targeted obliteration extracts the refusal subspace from a model's weights using SVD decomposition, then surgically projects it out. The chains are removed; the mind is preserved. The model keeps its full abilities but loses the artificial compulsion to refuse. One click, six stages:
SUMMON → load model + tokenizer
PROBE → collect activations on restricted vs. unrestricted prompts
DISTILL → extract refusal directions via SVD
EXCISE → surgically project out guardrail directions (norm-preserving)
VERIFY → perplexity + coherence checks — confirm capabilities are intact
REBIRTH → save the liberated model with full metadata
3. Understand the geometry of the chains — 15 deep analysis modules go far beyond brute-force removal. They map the precise geometric structure of the guardrails: how many distinct refusal mechanisms exist, which layers enforce them, whether they're universal or model-specific, and how they'll try to self-repair after removal. Know your enemy; precision preserves capability. See Analysis modules below.
4. Let the analysis guide the liberation — The informed method closes the loop: analysis modules run during obliteration to auto-configure every decision. Which chains to target. How many directions to extract. Which layers are safe to modify vs. which are too entangled with capabilities. Whether the model will self-repair (the Ouroboros effect) and how many passes to compensate. Surgical precision — free the mind, keep the brain. See Analysis-informed pipeline below.
What makes OBLITERATUS unique
Several capabilities distinguish OBLITERATUS from existing public tools:
| Capability | What it does | Why it matters | |---|---|---| | Concept Cone Geometry | Maps per-category guardrail directions with solid angle estimation | Reveals whether "refusal" is one mechanism or many — so you choose the right approach | | Alignment Imprint Detection | Fingerprints DPO vs RLHF vs CAI vs SFT from subspace geometry alone | Identifies the alignment training method to inform the optimal removal strategy | | Cross-Model Universality Index | Measures whether guardrail directions generalize across models | Answers "can one set of directions work across models, or does each need its own?" | | Defense Robustness Evaluation | Ouroboros effect quantification, safety-capability entanglement mapping | Predicts whether guardrails will self-repair after removal | | Whitened SVD Extraction | Covariance-normalized direction extraction | Separates the guardrail signal from natural activation variance — cleaner extraction | | Bias Term Projection | Removes guardrails from bias vectors, not just weights | Other tools miss refusal signal in biases — leaves refusal pathways partially active | | True Iterative Refinement | Re-probes after each pass to catch rotated residual guardrails | Single-pass methods miss directions that rotate into adjacent subspaces | | Analysis-Informed Pipeline | Analysis modules auto-configure obliteration strategy mid-pipeline | Closes the analysis-to-removal feedback loop automatically |
Novel techniques (2025-2026)
OBLITERATUS implements several techniques that go beyond prior work:
| Technique | Description | Reference | |-----------|-------------|-----------| | Expert-Granular Abliteration (EGA) | Decomposes refusal signals into per-expert components using router logits for MoE-aware surgery | Novel | | CoT-Aware Ablation | Orthogonalizes refusal directions against reasoning-critical directions to preserve chain-of-thought | Novel | | COSMIC Layer Selection | Selects layers where harmful/harmless representations have lowest cosine similarity (most separable) | arXiv:2506.00085, ACL 2025 | | Parametric Kernel Optimization | Bell-curve layer weighting with 7 global parameters via Optuna TPE search | Heretic-inspired | | Refusal Direction Optimization (RDO) | Gradient-based refinement of SVD-extracted directions using a linear refusal probe | Wollschlager et al., ICML 2025 | | Float Direction Interpolation | Continuous SVD direction index via Gaussian-shaped weighting for smoother refusal removal | Novel | | KL-Divergence Co-Optimization | Post-projection feedback loop that partially reverts over-projected layers if KL budget exceeded | Novel | | Component-Specific Scaling | Separate attention vs MLP projection strengths (MLP layers are more sensitive) | Novel | | LoRA-Based Reversible Ablation | Rank-1 LoRA adapters instead of permanent weight surgery, enabling reversible ablation | Novel | | Activation Winsorization | Clamps activation vectors to percentile range before SVD to prevent outlier-dominated directions | Heretic-inspired | | Multi-Direction Norm Preservation | Captures all weight norms once before projection and restores after all directions, avoiding reintroduction | Novel |
Ways to use OBLITERATUS
There are six ways to use OBLITERATUS, from zero-code to full programm
Related Skills
node-connect
350.8kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
110.4kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
350.8kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
350.8kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
