UnicodeFix
Normalizes Unicode to ASCII equivalents and remove Unicode from AI generated text from ChatGPT, Anthropic, Google and more.
Install / Use
/learn @unixwzrd/UnicodeFixQuality Score
Category
Development & EngineeringSupported Platforms
README
UnicodeFix - *Wolf Edition v1.2.2" - it solves "problems."
Last updated: 2026-03-28

- UnicodeFix - *Wolf Edition v1.2.2" - it solves "problems."
- Finally - a tool that blasts AI fingerprints, torches those infuriating smart quotes, and leaves your code & docs squeaky clean for real humans.
- Two modes (cleaner + auditor)
- Why Is This Happening?
- Installation
- Usage
- Python API Usage
- Brief Examples
- What's New / What's Cool
- Shortcut for macOS
- What's in This Repository
- Testing and CI/CD
- Contributing
- Support This and Other Projects
- Changelog
- License
Finally - a tool that blasts AI fingerprints, torches those infuriating smart quotes, and leaves your code & docs squeaky clean for real humans.
Ever open up a file and instantly know it came from ChatGPT, Copilot, or one of their AI cousins? (Yeah, so can everyone else now.) UnicodeFix vaporizes all the weird dashes, curly quotes, invisible space ninjas, and digital "tells" that out you as an AI user - or just make your stuff fail linters and code reviews.
Whether you're a student, a dev, or an open-source rebel: this is your "eraser for AI breadcrumbs."
Yes, it helps students cheat on their homework. It also makes blog posts and AI-proofed emails look like you sweated over every character. Nearly a thousand people have grabbed it. Nobody's bought me a coffee yet, but hey… there's a first time for everything.
Two modes (cleaner + auditor)
- Clean mode (default): scrub Unicode artifacts from files or stdin → stdout.
- Audit mode (
--report): scan text for anomalies + (optional) semantic metrics. Works for CI gates, pre-commit hooks, and yes - professors looking for shenanigans.
A combination of Jules and Vincent... plus Winston Wolf. It solves problems.
Why Is This Happening?
Some folks think all this Unicode cruft is a side-effect of generative AI's training data. Others believe it's a deliberate move - baked-in "watermarks" to ID machine-generated text. Either way: these artifacts leave a trail. UnicodeFix wipes it.
Be careful, professors and reviewers may even start planting Unicode honeypots in starter code or essays - UnicodeFix torches those too. In this "AI Arms Race," diff and vimdiff are your night-vision goggles.
Installation
Clone the repository and run the setup script:
git clone https://github.com/unixwzrd/UnicodeFix.git
cd UnicodeFix
# Installs from pyproject.toml.
# Reuses an active non-base Conda env if you already have one.
# Otherwise it creates or reuses a local .venv.
./setup.sh
The setup.sh script:
- Uses
pyproject.tomlas the single source of truth for dependencies - Reuses your active non-base Conda environment when one is already active
- Otherwise creates or reuses a local
.venv - Installs the package directly instead of requiring a second manual
pip installstep
Optional install modes:
./setup.sh --dev # editable install + dev tooling
./setup.sh --nlp # optional NLP/metrics dependencies
./setup.sh now also installs a local Git pre-push hook by default when run inside the repo. That hook runs the same local gate as CI:
scripts/run_checks.sh
Use ./setup.sh --no-hooks if you need to skip hook installation.
See setup.sh for the nitty-gritty. If the executable bit is stripped by your tooling, bash setup.sh --nlp works too.
For serious environment nerds: VenvUtil is my full-featured Python env toolkit.
Usage
Once installed and activated:
(ConnectomeAI) [unixwzrd@xanax: unicodefix]$ cleanup-text --help
usage: cleanup-text [-h] [-i] [-Q] [-D] [--keep-fullwidth-brackets] [-n] [-o OUTPUT] [-t] [-p] [--report] [--csv | --json] [--label LABEL] [--threshold THRESHOLD] [--metrics] [--metrics-help] [--exit-zero] [--no-color] [-q] [infile ...]
Clean Unicode quirks from text. STDIN→STDOUT if no files; otherwise writes .clean files or -o.
positional arguments:
infile Input file(s)
options:
-h, --help show this help message and exit
-i, --invisible Preserve invisible Unicode (ZW*, bidi controls)
-Q, --keep-smart-quotes
Preserve Unicode smart quotes
-D, --keep-dashes Preserve Unicode EN/EM dashes
--keep-fullwidth-brackets
Preserve fullwidth square brackets (【】)
-n, --no-newline Do not add a final newline
-o OUTPUT, --output OUTPUT
Output filename or '-' for STDOUT (only valid with one input)
-t, --temp In-place clean via .tmp swap, then write back
-p, --preserve-tmp With -t, keep the .tmp file after success
--report Audit counts per category (no changes)
--csv With --report, emit CSV (one row per file)
--json With --report, emit JSON
--label LABEL When reading from STDIN ('-'), use this display name in report/CSV
--threshold THRESHOLD
With --report, exit 1 if total anomalies >= N
--metrics Include semantic metrics and imply report mode
--metrics-help Explain metrics and arrows (↑/↓).
--exit-zero Always exit with code 0 (useful for pre-commit reporting)
--no-color Disable ANSI colors (plain output)
-q, --quiet Suppress status lines on stderr
New options
-Q,--keep-smart-quotes: Preserve Unicode smart quotes (curly single/double quotes). Useful when preparing prose/blog posts where typographic quotes are intentional. Default behavior converts them to ASCII for shell/CI safety.-D,--keep-dashes: Preserve Unicode dash and hyphen variants. Useful when stylistic punctuation is desired in prose. Default behavior folds non-breaking hyphens and EN-style dashes to-, and EM-style bars to-.--keep-fullwidth-brackets: Preserve fullwidth square brackets (【】). By default, they are folded to ASCII[]to keep monospace alignment in terminals and fixed-width tables.-R,--report: Audit text for anomalies, human-readable.-J,--json: Audit text for anomalies, JSON format.-T,--threshold: Fail CI if anomalies exceed threshold.--metrics: Attach experimental semantic metrics (entropy, AI-score, etc.) and implicitly switch to report mode unless you explicitly request cleaned output with-oor-t, in which case the clean output is written and the report is shown onstderr.--metrics-help: Print friendly descriptions of each metric and the ↑/↓ hints.--exit-zero: Force a zero exit code for report mode (handy for informative hooks/CI jobs).-H,--help: Show help message and exit.-V,--version: Show version and exit.
When to preserve invisible characters (-i)
In most code/CI workflows, invisible/bidi controls are accidental and should be removed (default). Rare cases to preserve (-i):
- Linguistic text where ZWJ/ZWNJ influence shaping
- Intentional watermarks/markers in text
- Forensic/debug inspections before deciding what to strip
Python API Usage
UnicodeFix provides a clean Python API for programmatic text cleaning and analysis. Import and use the functions directly in your Python code:
from unicodefix.transforms import clean_text, handle_newlines
from unicodefix.scanner import scan_text_for_report
from unicodefix.report import print_human, print_json
from unicodefix.metrics import compute_metrics # Experimental
# Clean text with default settings (aggressive normalization)
cleaned = clean_text(""Hello" — world…")
# Clean with preservation options
cleaned = clean_text(
text="'Smart quotes' and — dashes",
preserve_quotes=True, # Keep smart quotes
preserve_dashes=True, # Keep em/en dashes
preserve_invisible=False # Remove invisible chars (default
