SkillAgentSearch skills...

Hashformers

Accurate word segmentation for hashtags and text, powered by Transformers and Beam Search. A scalable alternative to heuristic splitters and massive LLMs.

Install / Use

/learn @ruanchaves/Hashformers

README

✂️ hashformers

Open In Colab PyPi license stars

Hashformers is a word segmentation library that fills a gap in the NLP ecosystem between heuristic-based splitters and LLM prompt-based segmentation. It can be used with any language model from the Hugging Face Model Hub, from auto-regressive models like GPT-2 to recent large language models (LLMs).

Hashformers uses language models and a beam search algorithm to segment text without spaces into words. Benchmarks show that it can outperform heuristic-based splitters and LLM prompt-based approaches on word segmentation tasks.

<p align="center"> <h3> <a href="https://colab.research.google.com/github/ruanchaves/hashformers/blob/master/hashformers.ipynb"> ✂️ Google Colab Tutorial </a> </h3> </p> <p align="center"> <h3> <a href="https://github.com/ruanchaves/hashformers/blob/master/tutorials/EVALUATION-January_2026.md"> ✂️ Evaluation Report </a> </h3> </p>

🚀 Quick Start

Installation

pip install hashformers

Basic Usage

from hashformers import TransformerWordSegmenter as WordSegmenter

ws = WordSegmenter(
    segmenter_model_name_or_path="distilgpt2"
) # You can use any model from the Hugging Face Model Hub

segmentations = ws.segment([
    "#weneedanationalpark",
    "#icecold"
])

print(segmentations)
# ['we need a national park', 'ice cold']

Using Language-Specific Models

# Russian hashtags with RuGPT3
ws = WordSegmenter(
    segmenter_model_name_or_path="ai-forever/rugpt3small_based_on_gpt2"
)

segmentations = ws.segment(["#москвасити"])

print(segmentations)
# ['москва сити']

spaCy Integration

Hashformers can be used as a spaCy pipeline component:

import spacy
import hashformers.spacy  # registers the "hashformers" component

nlp = spacy.blank("en")
nlp.add_pipe("hashformers", config={"model": "distilgpt2"})

doc = nlp("#weneedanationalpark")
print(doc._.segmented)  # "we need a national park"

Install with spaCy support:

pip install hashformers[spacy]

When to Use Hashformers?

The table below outlines when to use Hashformers versus other approaches like heuristic-based splitters (e.g., SymSpell, WordNinja) or large LLMs.

| Approach | Examples | Recommended When... | Notes | |----------|----------|---------------------|-------| | Heuristic-based | SymSpell, Ekphrasis, WordNinja, Spiral (Ronin) | • Scalability is a primary requirement.<br><br>• The segmentation domain works well with a standard pre-built vocabulary. | Fast and efficient, but requires a pre-built vocabulary which can be limiting for niche domains or languages. | | Hashformers | Hashformers | • Scalability is needed.<br><br>• You are working in a domain or language where a Language Model is readily available, but compiling a manual vocabulary for your task is too burdensome. | Evidence shows Hashformers can be superior to LLMs of similar scale (0.5B parameters). | | Large LLMs | OpenAI, Local LLM Deployment | • Cost, latency, and scalability are not concerns.<br><br>• You are segmenting a low volume of items. | To gain an accuracy advantage over Hashformers, you generally need to use significantly larger LLMs. |


📚 Research & Citations

Hashformers was recognized as state-of-the-art for hashtag segmentation at LREC 2022.

Papers Using Hashformers

Citation

If you find Hashformers useful, please consider citing our paper:

@misc{rodrigues2021zeroshot,
      title={Zero-shot hashtag segmentation for multilingual sentiment analysis}, 
      author={Ruan Chaves Rodrigues and Marcelo Akira Inuzuka and Juliana Resplande Sant'Anna Gomes and Acquila Santos Rocha and Iacer Calixto and Hugo Alexandre Dantas do Nascimento},
      year={2021},
      eprint={2112.03213},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

🤝 Contributing

Pull requests are welcome! Read our paper for details on the framework architecture.

git clone https://github.com/ruanchaves/hashformers.git
cd hashformers
pip install -e .

📖 Resources

View on GitHub
GitHub Stars77
CategoryEducation
Updated1mo ago
Forks5

Languages

Python

Security Score

100/100

Audited on Feb 21, 2026

No findings