SkillAgentSearch skills...

UnicodeFix

Normalizes Unicode to ASCII equivalents and remove Unicode from AI generated text from ChatGPT, Anthropic, Google and more.

Install / Use

/learn @unixwzrd/UnicodeFix

README

UnicodeFix - *Wolf Edition v1.2.2" - it solves "problems."

Last updated: 2026-03-28

UnicodeFix Hero Image

Python Platforms License: MIT Release CI


Finally - a tool that blasts AI fingerprints, torches those infuriating smart quotes, and leaves your code & docs squeaky clean for real humans.

Ever open up a file and instantly know it came from ChatGPT, Copilot, or one of their AI cousins? (Yeah, so can everyone else now.) UnicodeFix vaporizes all the weird dashes, curly quotes, invisible space ninjas, and digital "tells" that out you as an AI user - or just make your stuff fail linters and code reviews.

Whether you're a student, a dev, or an open-source rebel: this is your "eraser for AI breadcrumbs."

Yes, it helps students cheat on their homework. It also makes blog posts and AI-proofed emails look like you sweated over every character. Nearly a thousand people have grabbed it. Nobody's bought me a coffee yet, but hey… there's a first time for everything.


Two modes (cleaner + auditor)

  • Clean mode (default): scrub Unicode artifacts from files or stdin → stdout.
  • Audit mode (--report): scan text for anomalies + (optional) semantic metrics. Works for CI gates, pre-commit hooks, and yes - professors looking for shenanigans.

A combination of Jules and Vincent... plus Winston Wolf. It solves problems.


Why Is This Happening?

Some folks think all this Unicode cruft is a side-effect of generative AI's training data. Others believe it's a deliberate move - baked-in "watermarks" to ID machine-generated text. Either way: these artifacts leave a trail. UnicodeFix wipes it.

Be careful, professors and reviewers may even start planting Unicode honeypots in starter code or essays - UnicodeFix torches those too. In this "AI Arms Race," diff and vimdiff are your night-vision goggles.


Installation

Clone the repository and run the setup script:

git clone https://github.com/unixwzrd/UnicodeFix.git
cd UnicodeFix

# Installs from pyproject.toml.
# Reuses an active non-base Conda env if you already have one.
# Otherwise it creates or reuses a local .venv.
./setup.sh

The setup.sh script:

  • Uses pyproject.toml as the single source of truth for dependencies
  • Reuses your active non-base Conda environment when one is already active
  • Otherwise creates or reuses a local .venv
  • Installs the package directly instead of requiring a second manual pip install step

Optional install modes:

./setup.sh --dev   # editable install + dev tooling
./setup.sh --nlp   # optional NLP/metrics dependencies

./setup.sh now also installs a local Git pre-push hook by default when run inside the repo. That hook runs the same local gate as CI:

scripts/run_checks.sh

Use ./setup.sh --no-hooks if you need to skip hook installation.

See setup.sh for the nitty-gritty. If the executable bit is stripped by your tooling, bash setup.sh --nlp works too.

For serious environment nerds: VenvUtil is my full-featured Python env toolkit.


Usage

Once installed and activated:

(ConnectomeAI) [unixwzrd@xanax: unicodefix]$ cleanup-text --help
usage: cleanup-text [-h] [-i] [-Q] [-D] [--keep-fullwidth-brackets] [-n] [-o OUTPUT] [-t] [-p] [--report] [--csv | --json] [--label LABEL] [--threshold THRESHOLD] [--metrics] [--metrics-help] [--exit-zero] [--no-color] [-q] [infile ...]

Clean Unicode quirks from text. STDIN→STDOUT if no files; otherwise writes .clean files or -o.

positional arguments:
  infile                Input file(s)

options:
  -h, --help            show this help message and exit
  -i, --invisible       Preserve invisible Unicode (ZW*, bidi controls)
  -Q, --keep-smart-quotes
                        Preserve Unicode smart quotes
  -D, --keep-dashes     Preserve Unicode EN/EM dashes
  --keep-fullwidth-brackets
                        Preserve fullwidth square brackets (【】)
  -n, --no-newline      Do not add a final newline
  -o OUTPUT, --output OUTPUT
                        Output filename or '-' for STDOUT (only valid with one input)
  -t, --temp            In-place clean via .tmp swap, then write back
  -p, --preserve-tmp    With -t, keep the .tmp file after success
  --report              Audit counts per category (no changes)
  --csv                 With --report, emit CSV (one row per file)
  --json                With --report, emit JSON
  --label LABEL         When reading from STDIN ('-'), use this display name in report/CSV
  --threshold THRESHOLD
                        With --report, exit 1 if total anomalies >= N
  --metrics             Include semantic metrics and imply report mode
  --metrics-help        Explain metrics and arrows (↑/↓).
  --exit-zero           Always exit with code 0 (useful for pre-commit reporting)
  --no-color            Disable ANSI colors (plain output)
  -q, --quiet           Suppress status lines on stderr

New options

  • -Q, --keep-smart-quotes: Preserve Unicode smart quotes (curly single/double quotes). Useful when preparing prose/blog posts where typographic quotes are intentional. Default behavior converts them to ASCII for shell/CI safety.
  • -D, --keep-dashes: Preserve Unicode dash and hyphen variants. Useful when stylistic punctuation is desired in prose. Default behavior folds non-breaking hyphens and EN-style dashes to -, and EM-style bars to -.
  • --keep-fullwidth-brackets: Preserve fullwidth square brackets (【】). By default, they are folded to ASCII [] to keep monospace alignment in terminals and fixed-width tables.
  • -R, --report: Audit text for anomalies, human-readable.
  • -J, --json: Audit text for anomalies, JSON format.
  • -T, --threshold: Fail CI if anomalies exceed threshold.
  • --metrics: Attach experimental semantic metrics (entropy, AI-score, etc.) and implicitly switch to report mode unless you explicitly request cleaned output with -o or -t, in which case the clean output is written and the report is shown on stderr.
  • --metrics-help: Print friendly descriptions of each metric and the ↑/↓ hints.
  • --exit-zero: Force a zero exit code for report mode (handy for informative hooks/CI jobs).
  • -H, --help: Show help message and exit.
  • -V, --version: Show version and exit.

When to preserve invisible characters (-i)

In most code/CI workflows, invisible/bidi controls are accidental and should be removed (default). Rare cases to preserve (-i):

  • Linguistic text where ZWJ/ZWNJ influence shaping
  • Intentional watermarks/markers in text
  • Forensic/debug inspections before deciding what to strip

Python API Usage

UnicodeFix provides a clean Python API for programmatic text cleaning and analysis. Import and use the functions directly in your Python code:

from unicodefix.transforms import clean_text, handle_newlines
from unicodefix.scanner import scan_text_for_report
from unicodefix.report import print_human, print_json
from unicodefix.metrics import compute_metrics  # Experimental

# Clean text with default settings (aggressive normalization)
cleaned = clean_text(""Hello" — world…")

# Clean with preservation options
cleaned = clean_text(
    text="'Smart quotes' and — dashes",
    preserve_quotes=True,      # Keep smart quotes
    preserve_dashes=True,       # Keep em/en dashes
    preserve_invisible=False    # Remove invisible chars (default
View on GitHub
GitHub Stars24
CategoryDevelopment
Updated3d ago
Forks1

Languages

Python

Security Score

95/100

Audited on Mar 28, 2026

No findings