SqueakyCleanText
Text preprocessing and PII anonymisation for NLP/ML. ONNX NER ensemble, language detection, stopword removal. Built for statistical ML and language models.
Install / Use
/learn @rhnfzl/SqueakyCleanTextREADME
SqueakyCleanText
A comprehensive text cleaning and preprocessing pipeline for machine learning and NLP tasks.
</div>Using an AI coding assistant? This repo includes an
llms.txtwith the full API surface, config reference, and Q&A - optimised for Claude, Cursor, Copilot, and ChatGPT.
In the world of machine learning and natural language processing, clean and well-structured text data is crucial for building effective downstream models and managing token limits in language models.
SqueakyCleanText simplifies the process by automatically addressing common text issues - removing PII, anonymizing named entities (persons, organisations, locations), and ensuring your data is clean and well-structured for language models and classical ML pipelines with minimal effort on your part.
Key Features
- Named Entity Recognition (NER):
- Multi-backend: ONNX (default, torch-free), PyTorch, GLiNER, and ensemble modes
- Zero-shot custom entities via GLiNER (e.g., PRODUCT, EVENT, SKILL)
- Multi-language support (English, Dutch, German, Spanish, French, Portuguese, Italian)
- Ensemble voting across backends for improved accuracy
- Configurable confidence thresholds
- Lazy model loading (models load on demand per language)
- Shared ONNX sessions across same-model languages (~600 MB RAM saved)
- Automatic text chunking for long documents (CJK/Arabic safe)
- GPU acceleration support (CUDA for ONNX and PyTorch)
- Model warm-up API to pre-load on startup
- Text Normalization:
- Corrects text encoding problems and handles bad Unicode characters
- Removes or replaces HTML tags and URLs with configurable tokens
- Handles emails, phone numbers, and other contact details
- Multilingual date detection and replacement (ISO 8601, month names, common formats)
- Fuzzy date matching for misspelled months (requires
[fuzzy]extra) - Year and number standardization
- Configurable emoji removal
- Configurable bracket/brace content removal
- Removes isolated letters and symbols
- Normalizes whitespace and handles currency symbols
- Smart case folding (preserves NER tokens like
<PERSON>)
- Language Support:
- Automatic language detection (English, Dutch, German, Spanish)
- Language-specific NER models; French, Portuguese, Italian via multilingual model
- Language-aware stopword removal
- Extensible: add custom languages with stopwords, month names, and NER models
- Dual Output Formats:
- Language Model format (preserves structure with tokens)
- Statistical Model format (optimized for classical ML)
- Performance:
- ONNX Runtime inference (torch-free base install, ~3-5x faster than PyTorch)
- Thread-parallel batch processing via
ThreadPoolExecutor - Async batch processing (
aprocess_batch) for FastAPI / aiohttp - Lazy model loading (only loads models as needed)
- Shared ONNX sessions for same-model languages (saves ~600 MB for FR/PT/IT)
- Memory-efficient processing of large texts
- GPU acceleration (CUDA) for both ONNX and PyTorch backends

Benefits
For Language Models
- Maintains text structure while anonymizing sensitive information
- Configurable token replacements
- Preserves context while removing noise
- Handles long documents through intelligent chunking
For Statistical Models
- Removes stopwords and punctuation
- Case normalization
- Special symbol removal
- Optimized for classification tasks
Advanced NER Processing
- Ensemble approach reduces missed entities
- Language-specific models improve accuracy
- Confidence thresholds for precision control
- Efficient batch processing for large datasets
- Automatic handling of long documents
Installation
pip install SqueakyCleanText
The base install uses ONNX Runtime for NER inference - no PyTorch or Transformers required.
Optional Extras
| Extra | Command | What it adds |
|-------|---------|--------------|
| GPU | pip install SqueakyCleanText[gpu] | CUDA-accelerated ONNX inference |
| Fuzzy dates | pip install SqueakyCleanText[fuzzy] | Fuzzy month name matching (rapidfuzz) |
| PyTorch NER | pip install SqueakyCleanText[torch] | PyTorch/Transformers NER backend |
| GLiNER | pip install SqueakyCleanText[gliner] | GLiNER zero-shot NER |
| GLiNER2 | pip install SqueakyCleanText[gliner2] | GLiNER2 (knowledgator) backend |
| Synthetic | pip install SqueakyCleanText[synthetic] | Faker-based synthetic replacement (realistic fake values instead of <TAG> tokens) |
| Presidio | pip install SqueakyCleanText[presidio] | Presidio-analyzer for presidio_gliner backend |
| Classify | pip install SqueakyCleanText[classify] | GLiClass document-level pre-classification |
| All NER | pip install SqueakyCleanText[all-ner] | All NER backends combined |
| Development | pip install SqueakyCleanText[dev] | Testing and linting tools |
You can combine extras: pip install SqueakyCleanText[gpu,fuzzy,gliner]
Usage
Basic Usage
from sct import TextCleaner
# Initialize the TextCleaner
cleaner = TextCleaner()
# Input text
text = "Contact John Doe at john.doe@company.com. Meeting on 2023-10-01."
# Process the text
lm_text, stat_text, lang = cleaner.process(text)
print(f"Language Model format: {lm_text}")
# Output: "Contact <PERSON> at <EMAIL>. Meeting on <YEAR>."
print(f"Statistical Model format: {stat_text}")
# Output: "contact meeting"
print(f"Detected Language: {lang}")
# Output: "ENGLISH"
Using TextCleanerConfig
from sct import TextCleaner, TextCleanerConfig
# Create an immutable configuration
cfg = TextCleanerConfig(
check_ner_process=True,
ner_confidence_threshold=0.85,
positional_tags=('PER', 'LOC', 'ORG', 'MISC'),
replace_with_url="<URL>",
replace_with_email="<EMAIL>",
replace_with_phone_numbers="<PHONE>",
language="en", # Pin to English (also accepts 'ENGLISH', 'eng')
)
# Initialize with config
cleaner = TextCleaner(cfg=cfg)
Language Specification
All language parameters accept Lingua names ('ENGLISH'), ISO 639-1 ('en'), or ISO 639-3 ('eng') codes:
# Pin to one language (skip auto-detection)
cfg = TextCleanerConfig(language='de', check_ner_process=False)
# Restrict detection to specific languages (auto-detect among them)
cfg = TextCleanerConfig(language=('en', 'nl', 'de'), check_ner_process=False)
# Add extra languages for detection
cfg = TextCleanerConfig(extra_languages=('fr', 'pt'), check_ner_process=False)
GLiNER: Zero-Shot Custom NER
Use GLiNER to recognize any entity type without retraining:
from sct import TextCleaner, TextCleanerConfig
cfg = TextCleanerConfig(
ner_backend='gliner',
gliner_model='urchade/gliner_large-v2.1',
gliner_labels=('person', 'organization', 'location', 'product', 'event'),
gliner_label_map={
'person': 'PER', 'organization': 'ORG', 'location': 'LOC',
# 'product' and 'event' are unmapped - they become <PRODUCT>, <EVENT> tokens
},
gliner_threshold=0.4,
)
cleaner = TextCleaner(cfg=cfg)
lm_text, stat_text, lang = cleaner.process(
"John bought an iPhone at the Apple Store in Berlin during CES 2025."
)
# lm_text: "<PERSON> bought an <PRODUCT> at the <ORGANISATION> in <LOCATION> during <EVENT>."
Ensemble NER
Combine ONNX/Torch models with GLiNER for improved recall via ensemble voting:
from sct import TextCleaner, TextCleanerConfig
cfg = TextCleanerConfig(
ner_backend='ensemble_onnx', # or 'ensemble_torch'
gliner_model='urchade/gliner_large-v2.1',
gliner_labels=('person', 'organization', 'location'),
gliner_label_map={'person': 'PER', 'organization': 'ORG', 'location': 'LOC'},
)
cleaner = TextCleaner(cfg=cfg)
lm_text, stat_text, lang = cleaner.process("Angela Merkel visited the Bundestag in Berlin.")
PII Detection Mode
Automatically configure GLiNER for comprehensive PII detection with 60+ entity types (personal, financial, healthcare, identity, digital):
from sct import TextCleaner, TextCleanerConfig
cfg = TextCleanerConfig(ner_mode='pii')
cleaner = TextCleaner(cfg=cfg)
lm_text, stat_text, lang = cleaner.process(
"John Smith's SSN is 123-45-6789, email john@example.com, DOB 1990-01-15"
)
# Entities are anonymized: names, SSNs, emails, dates of birth, and 50+ more PII types
PII mode auto-configures: ner_backend='gliner', uses knowledgator/gliner-pii-base-v1.0, sets threshold to 0.3 (recall-focused), and expands positional tags. User-provided values always take priority.
Alternative PII models (pass as gliner_model):
| Model | Type | Size | Labels | F1 |
|-------|------|------|--------|-----|
| knowledgator/gliner-pii-base-v1.0 | Uni-encoder | 330MB (ONNX FP16) | 60+ | 80.99% |
| nvidia/gliner-PII | Bi-encoder | 570MB | 55+ | — |
| [gretelai/gretel-gliner-bi-base-v1.0](https://huggingfac
