SkillAgentSearch skills...

SqueakyCleanText

Text preprocessing and PII anonymisation for NLP/ML. ONNX NER ensemble, language detection, stopword removal. Built for statistical ML and language models.

Install / Use

/learn @rhnfzl/SqueakyCleanText

README

<div align="center">

SqueakyCleanText

PyPI PyPI - Downloads Python package Python Versions License

A comprehensive text cleaning and preprocessing pipeline for machine learning and NLP tasks.

</div>

Using an AI coding assistant? This repo includes an llms.txt with the full API surface, config reference, and Q&A - optimised for Claude, Cursor, Copilot, and ChatGPT.

In the world of machine learning and natural language processing, clean and well-structured text data is crucial for building effective downstream models and managing token limits in language models.

SqueakyCleanText simplifies the process by automatically addressing common text issues - removing PII, anonymizing named entities (persons, organisations, locations), and ensuring your data is clean and well-structured for language models and classical ML pipelines with minimal effort on your part.

Key Features

  • Named Entity Recognition (NER):
    • Multi-backend: ONNX (default, torch-free), PyTorch, GLiNER, and ensemble modes
    • Zero-shot custom entities via GLiNER (e.g., PRODUCT, EVENT, SKILL)
    • Multi-language support (English, Dutch, German, Spanish, French, Portuguese, Italian)
    • Ensemble voting across backends for improved accuracy
    • Configurable confidence thresholds
    • Lazy model loading (models load on demand per language)
    • Shared ONNX sessions across same-model languages (~600 MB RAM saved)
    • Automatic text chunking for long documents (CJK/Arabic safe)
    • GPU acceleration support (CUDA for ONNX and PyTorch)
    • Model warm-up API to pre-load on startup
  • Text Normalization:
    • Corrects text encoding problems and handles bad Unicode characters
    • Removes or replaces HTML tags and URLs with configurable tokens
    • Handles emails, phone numbers, and other contact details
    • Multilingual date detection and replacement (ISO 8601, month names, common formats)
    • Fuzzy date matching for misspelled months (requires [fuzzy] extra)
    • Year and number standardization
    • Configurable emoji removal
    • Configurable bracket/brace content removal
    • Removes isolated letters and symbols
    • Normalizes whitespace and handles currency symbols
    • Smart case folding (preserves NER tokens like <PERSON>)
  • Language Support:
    • Automatic language detection (English, Dutch, German, Spanish)
    • Language-specific NER models; French, Portuguese, Italian via multilingual model
    • Language-aware stopword removal
    • Extensible: add custom languages with stopwords, month names, and NER models
  • Dual Output Formats:
    • Language Model format (preserves structure with tokens)
    • Statistical Model format (optimized for classical ML)
  • Performance:
    • ONNX Runtime inference (torch-free base install, ~3-5x faster than PyTorch)
    • Thread-parallel batch processing via ThreadPoolExecutor
    • Async batch processing (aprocess_batch) for FastAPI / aiohttp
    • Lazy model loading (only loads models as needed)
    • Shared ONNX sessions for same-model languages (saves ~600 MB for FR/PT/IT)
    • Memory-efficient processing of large texts
    • GPU acceleration (CUDA) for both ONNX and PyTorch backends

Default Flow of cleaning Text

Benefits

For Language Models

  • Maintains text structure while anonymizing sensitive information
  • Configurable token replacements
  • Preserves context while removing noise
  • Handles long documents through intelligent chunking

For Statistical Models

  • Removes stopwords and punctuation
  • Case normalization
  • Special symbol removal
  • Optimized for classification tasks

Advanced NER Processing

  • Ensemble approach reduces missed entities
  • Language-specific models improve accuracy
  • Confidence thresholds for precision control
  • Efficient batch processing for large datasets
  • Automatic handling of long documents

Installation

pip install SqueakyCleanText

The base install uses ONNX Runtime for NER inference - no PyTorch or Transformers required.

Optional Extras

| Extra | Command | What it adds | |-------|---------|--------------| | GPU | pip install SqueakyCleanText[gpu] | CUDA-accelerated ONNX inference | | Fuzzy dates | pip install SqueakyCleanText[fuzzy] | Fuzzy month name matching (rapidfuzz) | | PyTorch NER | pip install SqueakyCleanText[torch] | PyTorch/Transformers NER backend | | GLiNER | pip install SqueakyCleanText[gliner] | GLiNER zero-shot NER | | GLiNER2 | pip install SqueakyCleanText[gliner2] | GLiNER2 (knowledgator) backend | | Synthetic | pip install SqueakyCleanText[synthetic] | Faker-based synthetic replacement (realistic fake values instead of <TAG> tokens) | | Presidio | pip install SqueakyCleanText[presidio] | Presidio-analyzer for presidio_gliner backend | | Classify | pip install SqueakyCleanText[classify] | GLiClass document-level pre-classification | | All NER | pip install SqueakyCleanText[all-ner] | All NER backends combined | | Development | pip install SqueakyCleanText[dev] | Testing and linting tools |

You can combine extras: pip install SqueakyCleanText[gpu,fuzzy,gliner]

Usage

Basic Usage

from sct import TextCleaner

# Initialize the TextCleaner
cleaner = TextCleaner()

# Input text
text = "Contact John Doe at john.doe@company.com. Meeting on 2023-10-01."

# Process the text
lm_text, stat_text, lang = cleaner.process(text)

print(f"Language Model format:    {lm_text}")
# Output: "Contact <PERSON> at <EMAIL>. Meeting on <YEAR>."

print(f"Statistical Model format: {stat_text}")
# Output: "contact meeting"

print(f"Detected Language: {lang}")
# Output: "ENGLISH"

Using TextCleanerConfig

from sct import TextCleaner, TextCleanerConfig

# Create an immutable configuration
cfg = TextCleanerConfig(
    check_ner_process=True,
    ner_confidence_threshold=0.85,
    positional_tags=('PER', 'LOC', 'ORG', 'MISC'),
    replace_with_url="<URL>",
    replace_with_email="<EMAIL>",
    replace_with_phone_numbers="<PHONE>",
    language="en",  # Pin to English (also accepts 'ENGLISH', 'eng')
)

# Initialize with config
cleaner = TextCleaner(cfg=cfg)

Language Specification

All language parameters accept Lingua names ('ENGLISH'), ISO 639-1 ('en'), or ISO 639-3 ('eng') codes:

# Pin to one language (skip auto-detection)
cfg = TextCleanerConfig(language='de', check_ner_process=False)

# Restrict detection to specific languages (auto-detect among them)
cfg = TextCleanerConfig(language=('en', 'nl', 'de'), check_ner_process=False)

# Add extra languages for detection
cfg = TextCleanerConfig(extra_languages=('fr', 'pt'), check_ner_process=False)

GLiNER: Zero-Shot Custom NER

Use GLiNER to recognize any entity type without retraining:

from sct import TextCleaner, TextCleanerConfig

cfg = TextCleanerConfig(
    ner_backend='gliner',
    gliner_model='urchade/gliner_large-v2.1',
    gliner_labels=('person', 'organization', 'location', 'product', 'event'),
    gliner_label_map={
        'person': 'PER', 'organization': 'ORG', 'location': 'LOC',
        # 'product' and 'event' are unmapped - they become <PRODUCT>, <EVENT> tokens
    },
    gliner_threshold=0.4,
)

cleaner = TextCleaner(cfg=cfg)
lm_text, stat_text, lang = cleaner.process(
    "John bought an iPhone at the Apple Store in Berlin during CES 2025."
)
# lm_text: "<PERSON> bought an <PRODUCT> at the <ORGANISATION> in <LOCATION> during <EVENT>."

Ensemble NER

Combine ONNX/Torch models with GLiNER for improved recall via ensemble voting:

from sct import TextCleaner, TextCleanerConfig

cfg = TextCleanerConfig(
    ner_backend='ensemble_onnx',  # or 'ensemble_torch'
    gliner_model='urchade/gliner_large-v2.1',
    gliner_labels=('person', 'organization', 'location'),
    gliner_label_map={'person': 'PER', 'organization': 'ORG', 'location': 'LOC'},
)

cleaner = TextCleaner(cfg=cfg)
lm_text, stat_text, lang = cleaner.process("Angela Merkel visited the Bundestag in Berlin.")

PII Detection Mode

Automatically configure GLiNER for comprehensive PII detection with 60+ entity types (personal, financial, healthcare, identity, digital):

from sct import TextCleaner, TextCleanerConfig

cfg = TextCleanerConfig(ner_mode='pii')

cleaner = TextCleaner(cfg=cfg)
lm_text, stat_text, lang = cleaner.process(
    "John Smith's SSN is 123-45-6789, email john@example.com, DOB 1990-01-15"
)
# Entities are anonymized: names, SSNs, emails, dates of birth, and 50+ more PII types

PII mode auto-configures: ner_backend='gliner', uses knowledgator/gliner-pii-base-v1.0, sets threshold to 0.3 (recall-focused), and expands positional tags. User-provided values always take priority.

Alternative PII models (pass as gliner_model):

| Model | Type | Size | Labels | F1 | |-------|------|------|--------|-----| | knowledgator/gliner-pii-base-v1.0 | Uni-encoder | 330MB (ONNX FP16) | 60+ | 80.99% | | nvidia/gliner-PII | Bi-encoder | 570MB | 55+ | — | | [gretelai/gretel-gliner-bi-base-v1.0](https://huggingfac

View on GitHub
GitHub Stars8
CategoryEducation
Updated7d ago
Forks0

Languages

Python

Security Score

90/100

Audited on Mar 25, 2026

No findings