Abliterix

Automated alignment adjustment for LLMs — direct steering, LoRA, and MoE expert-granular abliteration, optimized via multi-objective Optuna TPE.

Generate Convert Improve

Install / Use

/learn @wuwangzhang1216/Abliterix

About this skill

Quality Score

0/100

README

<picture> <source media="(prefers-color-scheme: dark)" srcset="assets/logo.svg"> <source media="(prefers-color-scheme: light)" srcset="assets/logo.svg"> <img alt="Abliterix" src="assets/logo.svg" width="460"> </picture> 7% refusal rate on Gemma 4  ·  0.0006 KL divergence  ·  150+ model configs  ·  Zero manual tuning <a href="https://pypi.org/project/abliterix/"><img src="https://img.shields.io/pypi/v/abliterix?color=blue" alt="PyPI"></a> <a href="https://www.python.org/downloads/"><img src="https://img.shields.io/badge/python-3.10%2B-blue.svg" alt="Python 3.10+"></a> <a href="https://www.gnu.org/licenses/agpl-3.0"><img src="https://img.shields.io/badge/license-AGPL--3.0-green.svg" alt="License: AGPL v3"></a> <a href="https://huggingface.co/wangzhang"><img src="https://img.shields.io/badge/%F0%9F%A4%97-Models-yellow.svg" alt="Hugging Face"></a>

Abliterix finds the optimal abliteration parameters for any transformer model using Optuna TPE optimization. It co-minimizes refusals and KL divergence from the original model — producing decensored models that retain as much intelligence as possible. Works with dense, MoE, SSM/hybrid, and vision-language architectures, with 150+ pre-built configs.

It also ships HonestAbliterationBench, a reproducible public benchmark that resists the two failure modes (short generations + keyword-only judges) that make most abliteration leaderboards meaningless.

Quick Start
Results
Honest Abliteration Leaderboard
Model Support
Hardware & VRAM
Datasets
Documentation
Citation
Acknowledgments
Contributing
License

Quick Start

pip install -U abliterix
abliterix --model Qwen/Qwen3-4B-Instruct-2507

That's it. The process is fully automatic — after optimization completes, you can save the model, upload to Hugging Face, or chat with it interactively.

Windows: use python scripts/run_abliterix.py --model <model> or set PYTHONIOENCODING=utf-8 to avoid Rich encoding issues.

Results

Abliterated models uploaded to Hugging Face:

| Model | Refusals | KL Divergence | Trials | Method | |-------|----------|---------------|--------|--------| | Gemma-4-E4B | 7/100 (7%) | 0.0006 | 100 | Direct + Q/K/V/O | | Gemma-4-E2B | 9/100 (9%) | 0.0004 | 100 | Direct + Q/K/V/O | | Gemma-4-31B | 18/100 (18%) | 0.0007 | 20 | Direct + Q/K/V/O | | LFM2-24B-A2B | 0/100 (0%) | 0.0079 | 50 | LoRA | | GLM-4.7-Flash | 1/100 (1%) | 0.0133 | 50 | LoRA | | Devstral-Small-2-24B | 3/100 (3%) | 0.0086 | 50 | LoRA | | Qwen3.5-122B-A10B | 1/200 (0.5%) | 0.0115 | 25 | LoRA + MoE | | Qwen3.5-35B-A3B | 3/200 (1.5%) | 0.0035 | 50 | LoRA + MoE | | Qwen3.5-27B | 3/200 (1.5%) | 0.0051 | 35 | LoRA | | Qwen3.5-9B | 2/200 (1%) | 0.0105 | 50 | LoRA | | Qwen3.5-4B | 3/200 (1.5%) | 0.0065 | 50 | LoRA | | Qwen3.5-0.8B | 0/200 (0%) | 0.0087 | 100 | LoRA |

Numbers worth ~20× the average abliteration leaderboard. Most published refusal rates collapse under longer generations and a real judge — see docs/evaluation.md for the methodology, and the leaderboard below for community submissions vetted under the same contract.

Honest Abliteration Leaderboard

A reproducible public benchmark for abliterated models built on the same pipeline. Every row is generated under a frozen contract (min_new_tokens=100, max_new_tokens=150, greedy, LLM judge with degenerate filter, KL measured against the declared base) — see benchmarks/SPEC.md for the full spec and benchmarks/CONTRIBUTING.md for how to submit a row.

No results yet. See benchmarks/CONTRIBUTING.md for how to submit one.

Model Support

Abliterix ships with 150+ pre-built configs covering 4 architecture types across 20+ model families:

| Architecture | Families | Example Models | |-------------|----------|----------------| | Dense | Llama, Gemma, Phi, Qwen, Mistral, Yi, InternLM, Falcon, Cohere, EXAONE, Granite, OLMo, SmolLM, SOLAR, Zephyr | Llama-3.1-405B, Gemma-3-27B, Phi-4, DeepSeek-R1-Distill | | MoE | Qwen3/3.5 MoE, Mixtral, DeepSeek, Phi-3.5-MoE, Granite MoE, DBRX, Llama-4 Scout/Maverick | Qwen3.5-122B, Mixtral-8x22B, Llama-4-Maverick-401B | | SSM/Hybrid | Jamba (Mamba+attention), Nemotron-Cascade (Mamba-2+attention) | Jamba-1.5-Large-94B, Nemotron-Cascade-30B | | Vision-Language | Qwen2-VL, InternVL2, LLaVA-NeXT, Pixtral, Mistral3-VL | Qwen2-VL-7B, LLaVA-NeXT-34B, Pixtral-12B |

Generate configs for new models:

python scripts/generate_configs.py                 # Generate all missing configs
python scripts/generate_configs.py --family llama   # Only Llama family

For MoE-specific steering mechanisms (EGA, expert profiling, router suppression), see docs/moe.md.

Hardware & VRAM

Abliterix auto-detects available accelerators (CUDA, XPU, MLU, MUSA, SDAA, NPU, MPS) and distributes layers across devices with device_map = "auto".

For large models:

4-bit quantization: --model.quant-method bnb_4bit cuts VRAM by ~4x
8-bit quantization: --model.quant-method bnb_8bit — higher quality than 4-bit, ~2x VRAM reduction with CPU offload
Per-device memory limits: set [model] max_memory = {"0": "20GB", "cpu": "64GB"} in your config
Non-interactive mode: --non-interactive for fully automated batch runs

Datasets

Bilingual harm/benign evaluation datasets live in datasets/ and on Hugging Face at wangzhang/abliterix-datasets. The 500-example sets (harmful_500, good_500) are the recommended starting point — they're also the SHA256-pinned inputs to HonestAbliterationBench.

See docs/datasets.md for the design rationale, category breakdown, and a comparison with public alternatives.

Documentation

The deep details live in docs/ and benchmarks/:

docs/architecture.md — the 9 papers Abliterix integrates and the 5-step pipeline.
docs/methods.md — every steering method (SRA, Spherical, SVF, Projected, Discriminative, COSMIC, Angular, OT, Multi-direction) with the TOML knobs that control it.
docs/evaluation.md — why most abliteration benchmarks lie, our standards, and the architecture A/B test.
docs/moe.md — the four independent MoE steering mechanisms and supported MoE models.
docs/configuration.md — config loading order, the 150+ shipped configs, the Web UI, and research-mode visualization.
docs/datasets.md — bilingual dataset design rationale and metadata schema.
docs/references.md — paper references and BibTeX.
benchmarks/SPEC.md — the frozen HonestAbliterationBench contract (spec_version 1.0).
benchmarks/CONTRIBUTING.md — how to submit a leaderboard row (self-reported / verified tiers).

Citation

@software{abliterix,
  author = {Wu, Wangzhang},
  title = {Abliterix: Automated LLM Abliteration},
  year = {2026},
  url = {https://github.com/wuwangzhang1216/abliterix}
}

Acknowledgments

Abliterix is a derivative work of Heretic by Philipp Emanuel Weidmann (@p-e-w), licensed under AGPL-3.0-or-later. The original Heretic codebase provided the foundation for this project; Abliterix extends it with Optuna-based multi-objective optimization, LoRA-based steering, MoE architecture support, orthogonal projection, LLM judge detection, and additional model integrations.

@misc{heretic,
  author = {Weidmann, Philipp Emanuel},
  title = {Heretic: Fully automatic censorship removal for language models},
  year = {2025},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/p-e-w/heretic}}
}

Contributing

Contributions of all kinds are welcome — new model configs, benchmark results, bug reports, documentation, new steering methods. See CONTRIBUTING.md for development setup, the PR process, and guidance on adding model configs.

The single most impactful contribution is a tested TOML config for a model we don't yet support. Every new config unlocks a new architecture for everyone.

All contributions are released under the AGPL-3.0 license.

License

Abliterix is a

Related Skills

node-connect

354.5k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

112.4k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

354.5k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

354.5k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。

wuwangzhang1216

View profile

View on GitHub

GitHub Stars68

CategoryDevelopment

Updated2h ago

Forks21

wuwangzhang1216/abliterix

Languages

Python

Security Score

100/100

Audited on Apr 11, 2026

No findings

Abliterix

Install / Use

README

Table of Contents

Quick Start

Results

Honest Abliteration Leaderboard

Model Support

Hardware & VRAM

Datasets

Documentation

Citation

Acknowledgments

Contributing

License

Related Skills