SqueakyCleanText

A comprehensive text cleaning and preprocessing pipeline for machine learning and NLP tasks.

</div>

Using an AI coding assistant? This repo includes an llms.txt with the full API surface, config reference, and Q&A - optimised for Claude, Cursor, Copilot, and ChatGPT.

In the world of machine learning and natural language processing, clean and well-structured text data is crucial for building effective downstream models and managing token limits in language models.

SqueakyCleanText simplifies the process by automatically addressing common text issues - removing PII, anonymizing named entities (persons, organisations, locations), and ensuring your data is clean and well-structured for language models and classical ML pipelines with minimal effort on your part.

Key Features

Named Entity Recognition (NER):
- Multi-backend: ONNX (default, torch-free), PyTorch, GLiNER, and ensemble modes
- Zero-shot custom entities via GLiNER (e.g., PRODUCT, EVENT, SKILL)
- Multi-language support (English, Dutch, German, Spanish, French, Portuguese, Italian)
- Ensemble voting across backends for improved accuracy
- Configurable confidence thresholds
- Lazy model loading (models load on demand per language)
- Shared ONNX sessions across same-model languages (~600 MB RAM saved)
- Automatic text chunking for long documents (CJK/Arabic safe)
- GPU acceleration support (CUDA for ONNX and PyTorch)
- Model warm-up API to pre-load on startup
Text Normalization:
- Corrects text encoding problems and handles bad Unicode characters
- Removes or replaces HTML tags and URLs with configurable tokens
- Handles emails, phone numbers, and other contact details
- Multilingual date detection and replacement (ISO 8601, month names, common formats)
- Fuzzy date matching for misspelled months (requires [fuzzy] extra)
- Year and number standardization
- Configurable emoji removal
- Configurable bracket/brace content removal
- Removes isolated letters and symbols
- Normalizes whitespace and handles currency symbols
- Smart case folding (preserves NER tokens like <PERSON>)
Language Support:
- Automatic language detection (English, Dutch, German, Spanish)
- Language-specific NER models; French, Portuguese, Italian via multilingual model
- Language-aware stopword removal
- Extensible: add custom languages with stopwords, month names, and NER models
Dual Output Formats:
- Language Model format (preserves structure with tokens)
- Statistical Model format (optimized for classical ML)
Performance:
- ONNX Runtime inference (torch-free base install, ~3-5x faster than PyTorch)
- Thread-parallel batch processing via ThreadPoolExecutor
- Async batch processing (aprocess_batch) for FastAPI / aiohttp
- Lazy model loading (only loads models as needed)
- Shared ONNX sessions for same-model languages (saves ~600 MB for FR/PT/IT)
- Memory-efficient processing of large texts
- GPU acceleration (CUDA) for both ONNX and PyTorch backends

Default Flow of cleaning Text

Benefits

For Language Models

Maintains text structure while anonymizing sensitive information
Configurable token replacements
Preserves context while removing noise
Handles long documents through intelligent chunking

For Statistical Models

Removes stopwords and punctuation
Case normalization
Special symbol removal
Optimized for classification tasks

Advanced NER Processing

Ensemble approach reduces missed entities
Language-specific models improve accuracy
Confidence thresholds for precision control
Efficient batch processing for large datasets
Automatic handling of long documents

Installation

pip install SqueakyCleanText

The base install uses ONNX Runtime for NER inference - no PyTorch or Transformers required.

Optional Extras

| Extra | Command | What it adds | |-------|---------|--------------| | GPU | pip install SqueakyCleanText[gpu] | CUDA-accelerated ONNX inference | | Fuzzy dates | pip install SqueakyCleanText[fuzzy] | Fuzzy month name matching (rapidfuzz) | | PyTorch NER | pip install SqueakyCleanText[torch] | PyTorch/Transformers NER backend | | GLiNER | pip install SqueakyCleanText[gliner] | GLiNER zero-shot NER | | GLiNER2 | pip install SqueakyCleanText[gliner2] | GLiNER2 (knowledgator) backend | | Synthetic | pip install SqueakyCleanText[synthetic] | Faker-based synthetic replacement (realistic fake values instead of <TAG> tokens) | | Presidio | pip install SqueakyCleanText[presidio] | Presidio-analyzer for presidio_gliner backend | | Classify | pip install SqueakyCleanText[classify] | GLiClass document-level pre-classification | | All NER | pip install SqueakyCleanText[all-ner] | All NER backends combined | | Development | pip install SqueakyCleanText[dev] | Testing and linting tools |

You can combine extras: pip install SqueakyCleanText[gpu,fuzzy,gliner]

Usage

Basic Usage

from sct import TextCleaner

# Initialize the TextCleaner
cleaner = TextCleaner()

# Input text
text = "Contact John Doe at john.doe@company.com. Meeting on 2023-10-01."

# Process the text
lm_text, stat_text, lang = cleaner.process(text)

print(f"Language Model format:    {lm_text}")
# Output: "Contact <PERSON> at <EMAIL>. Meeting on <YEAR>."

print(f"Statistical Model format: {stat_text}")
# Output: "contact meeting"

print(f"Detected Language: {lang}")
# Output: "ENGLISH"

Using TextCleanerConfig

from sct import TextCleaner, TextCleanerConfig

# Create an immutable configuration
cfg = TextCleanerConfig(
    check_ner_process=True,
    ner_confidence_threshold=0.85,
    positional_tags=('PER', 'LOC', 'ORG', 'MISC'),
    replace_with_url="<URL>",
    replace_with_email="<EMAIL>",
    replace_with_phone_numbers="<PHONE>",
    language="en",  # Pin to English (also accepts 'ENGLISH', 'eng')
)

# Initialize with config
cleaner = TextCleaner(cfg=cfg)

Language Specification

All language parameters accept Lingua names ('ENGLISH'), ISO 639-1 ('en'), or ISO 639-3 ('eng') codes:

# Pin to one language (skip auto-detection)
cfg = TextCleanerConfig(language='de', check_ner_process=False)

# Restrict detection to specific languages (auto-detect among them)
cfg = TextCleanerConfig(language=('en', 'nl', 'de'), check_ner_process=False)

# Add extra languages for detection
cfg = TextCleanerConfig(extra_languages=('fr', 'pt'), check_ner_process=False)

GLiNER: Zero-Shot Custom NER

Use GLiNER to recognize any entity type without retraining:

from sct import TextCleaner, TextCleanerConfig

cfg = TextCleanerConfig(
    ner_backend='gliner',
    gliner_model='urchade/gliner_large-v2.1',
    gliner_labels=('person', 'organization', 'location', 'product', 'event'),
    gliner_label_map={
        'person': 'PER', 'organization': 'ORG', 'location': 'LOC',
        # 'product' and 'event' are unmapped - they become <PRODUCT>, <EVENT> tokens
    },
    gliner_threshold=0.4,
)

cleaner = TextCleaner(cfg=cfg)
lm_text, stat_text, lang = cleaner.process(
    "John bought an iPhone at the Apple Store in Berlin during CES 2025."
)
# lm_text: "<PERSON> bought an <PRODUCT> at the <ORGANISATION> in <LOCATION> during <EVENT>."

Ensemble NER

Combine ONNX/Torch models with GLiNER for improved recall via ensemble voting:

from sct import TextCleaner, TextCleanerConfig

cfg = TextCleanerConfig(
    ner_backend='ensemble_onnx',  # or 'ensemble_torch'
    gliner_model='urchade/gliner_large-v2.1',
    gliner_labels=('person', 'organization', 'location'),
    gliner_label_map={'person': 'PER', 'organization': 'ORG', 'location': 'LOC'},
)

cleaner = TextCleaner(cfg=cfg)
lm_text, stat_text, lang = cleaner.process("Angela Merkel visited the Bundestag in Berlin.")

PII Detection Mode

Automatically configure GLiNER for comprehensive PII detection with 60+ entity types (personal, financial, healthcare, identity, digital):

from sct import TextCleaner, TextCleanerConfig

cfg = TextCleanerConfig(ner_mode='pii')

cleaner = TextCleaner(cfg=cfg)
lm_text, stat_text, lang = cleaner.process(
    "John Smith's SSN is 123-45-6789, email john@example.com, DOB 1990-01-15"
)
# Entities are anonymized: names, SSNs, emails, dates of birth, and 50+ more PII types

PII mode auto-configures: ner_backend='gliner', uses knowledgator/gliner-pii-base-v1.0, sets threshold to 0.3 (recall-focused), and expands positional tags. User-provided values always take priority.

Alternative PII models (pass as gliner_model):

| Model | Type | Size | Labels | F1 | |-------|------|------|--------|-----| | knowledgator/gliner-pii-base-v1.0 | Uni-encoder | 330MB (ONNX FP16) | 60+ | 80.99% | | nvidia/gliner-PII | Bi-encoder | 570MB | 55+ | — | | [gretelai/gretel-gliner-bi-base-v1.0](https://huggingfac

SqueakyCleanText

Install / Use

README

SqueakyCleanText

Key Features

Benefits

For Language Models

For Statistical Models

Advanced NER Processing

Installation

Optional Extras

Usage

Basic Usage

Using TextCleanerConfig

Language Specification

GLiNER: Zero-Shot Custom NER

Ensemble NER

PII Detection Mode