Hashformers
Accurate word segmentation for hashtags and text, powered by Transformers and Beam Search. A scalable alternative to heuristic splitters and massive LLMs.
Install / Use
/learn @ruanchaves/HashformersREADME
✂️ hashformers
Hashformers is a word segmentation library that fills a gap in the NLP ecosystem between heuristic-based splitters and LLM prompt-based segmentation. It can be used with any language model from the Hugging Face Model Hub, from auto-regressive models like GPT-2 to recent large language models (LLMs).
Hashformers uses language models and a beam search algorithm to segment text without spaces into words. Benchmarks show that it can outperform heuristic-based splitters and LLM prompt-based approaches on word segmentation tasks.
<p align="center"> <h3> <a href="https://colab.research.google.com/github/ruanchaves/hashformers/blob/master/hashformers.ipynb"> ✂️ Google Colab Tutorial </a> </h3> </p> <p align="center"> <h3> <a href="https://github.com/ruanchaves/hashformers/blob/master/tutorials/EVALUATION-January_2026.md"> ✂️ Evaluation Report </a> </h3> </p>🚀 Quick Start
Installation
pip install hashformers
Basic Usage
from hashformers import TransformerWordSegmenter as WordSegmenter
ws = WordSegmenter(
segmenter_model_name_or_path="distilgpt2"
) # You can use any model from the Hugging Face Model Hub
segmentations = ws.segment([
"#weneedanationalpark",
"#icecold"
])
print(segmentations)
# ['we need a national park', 'ice cold']
Using Language-Specific Models
# Russian hashtags with RuGPT3
ws = WordSegmenter(
segmenter_model_name_or_path="ai-forever/rugpt3small_based_on_gpt2"
)
segmentations = ws.segment(["#москвасити"])
print(segmentations)
# ['москва сити']
spaCy Integration
Hashformers can be used as a spaCy pipeline component:
import spacy
import hashformers.spacy # registers the "hashformers" component
nlp = spacy.blank("en")
nlp.add_pipe("hashformers", config={"model": "distilgpt2"})
doc = nlp("#weneedanationalpark")
print(doc._.segmented) # "we need a national park"
Install with spaCy support:
pip install hashformers[spacy]
When to Use Hashformers?
The table below outlines when to use Hashformers versus other approaches like heuristic-based splitters (e.g., SymSpell, WordNinja) or large LLMs.
| Approach | Examples | Recommended When... | Notes | |----------|----------|---------------------|-------| | Heuristic-based | SymSpell, Ekphrasis, WordNinja, Spiral (Ronin) | • Scalability is a primary requirement.<br><br>• The segmentation domain works well with a standard pre-built vocabulary. | Fast and efficient, but requires a pre-built vocabulary which can be limiting for niche domains or languages. | | Hashformers | Hashformers | • Scalability is needed.<br><br>• You are working in a domain or language where a Language Model is readily available, but compiling a manual vocabulary for your task is too burdensome. | Evidence shows Hashformers can be superior to LLMs of similar scale (0.5B parameters). | | Large LLMs | OpenAI, Local LLM Deployment | • Cost, latency, and scalability are not concerns.<br><br>• You are segmenting a low volume of items. | To gain an accuracy advantage over Hashformers, you generally need to use significantly larger LLMs. |
📚 Research & Citations
Hashformers was recognized as state-of-the-art for hashtag segmentation at LREC 2022.
Papers Using Hashformers
-
Zero-shot hashtag segmentation for multilingual sentiment analysis
-
Generalizability of Abusive Language Detection Models on Homogeneous German Datasets
-
The problem of varying annotations to identify abusive language in social media content
-
NUSS: An R package for mixed N-grams and unigram sequence segmentation
Citation
If you find Hashformers useful, please consider citing our paper:
@misc{rodrigues2021zeroshot,
title={Zero-shot hashtag segmentation for multilingual sentiment analysis},
author={Ruan Chaves Rodrigues and Marcelo Akira Inuzuka and Juliana Resplande Sant'Anna Gomes and Acquila Santos Rocha and Iacer Calixto and Hugo Alexandre Dantas do Nascimento},
year={2021},
eprint={2112.03213},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
🤝 Contributing
Pull requests are welcome! Read our paper for details on the framework architecture.
git clone https://github.com/ruanchaves/hashformers.git
cd hashformers
pip install -e .
