Deepcut
A Thai word tokenization library using Deep Neural Network
Install / Use
/learn @rkcosmos/DeepcutREADME
Deepcut
A Thai word tokenization library using Deep Neural Network.

What's new
v0.7.0Migrate from keras to TensorFlow 2.0v0.6.0Allow excluding stop words and custom dictionary, updated weight with semi-supervised learningv0.5.2Better pretrained weight matrixv0.5.1Faster tokenization by code refactorizationexamplesfolder provide starter script for Thai text classification problemDeepcutJS, you can try tokenizing Thai text on web browser here
Performance
The Convolutional Neural network is trained from 90 % of NECTEC's BEST corpus (consists of 4 sections, article, news, novel and encyclopedia) and test on the rest 10 %. It is a binary classification model trying to predict whether a character is the beginning of word or not. The results calculated from only 'true' class are as follow
| Precision | Recall | F1 | | --------- | ------ | ------ | | 97.8% | 98.5% | 98.1% |
Installation
Install using pip for stable release (tensorflow version2.0),
pip install deepcut
For latest development release (recommended),
pip install git+git://github.com/rkcosmos/deepcut.git
If you want to use tensorflow version 1.x and standalone keras, you will need
pip install deepcut==0.6.1
Docker
First, install and run docker on your machine. Then, you can build and run deepcut as follows
docker build -t deepcut:dev . # build docker image
docker run --rm -it deepcut:dev # run docker, -it flag makes it interactive, --rm for clean up the container and remove file system
This will open a shell for us to play with deepcut.
Usage
import deepcut
deepcut.tokenize('ตัดคำได้ดีมาก')
Output will be in list format
['ตัดคำ','ได้','ดี','มาก']
Bag-of-word transformation
We implemented a tokenizer which works similar to CountVectorizer from scikit-learn . Here is an example usage:
from deepcut import DeepcutTokenizer
tokenizer = DeepcutTokenizer(ngram_range=(1,1),
max_df=1.0, min_df=0.0)
X = tokenizer.fit_tranform(['ฉันบินได้', 'ฉันกินข้าว', 'ฉันอยากบิน']) # 3 x 6 CSR sparse matrix
print(tokenizer.vocabulary_) # {'บิน': 0, 'ได้': 1, 'ฉัน': 2, 'อยาก': 3, 'ข้าว': 4, 'กิน': 5}, column index of sparse matrix
X_test = tokenizer.transform(['ฉันกิน', 'ฉันไม่อยากบิน']) # use built tokenizer vobalurary to transform new text
print(X_test.shape) # 2 x 6 CSR sparse matrix
tokenizer.save_model('tokenizer.pickle') # save the tokenizer to use later
You can load the saved tokenizer to use later
tokenizer = deepcut.load_model('tokenizer.pickle')
X_sample = tokenizer.transform(['ฉันกิน', 'ฉันไม่อยากบิน'])
print(X_sample.shape) # getting the same 2 x 6 CSR sparse matrix as X_test
Custom Dictionary
User can add custom dictionary by adding path to .txt file with one word per line like the following.
ขี้เกียจ
โรงเรียน
ดีมาก
The file can be placed as an custom_dict argument in tokenize function e.g.
deepcut.tokenize('ตัดคำได้ดีมาก', custom_dict='/path/to/custom_dict.txt')
deepcut.tokenize('ตัดคำได้ดีมาก', custom_dict=['ดีมาก']) # alternatively, you can provide a list of custom dictionary
Notes
Some texts might not be segmented as we would expected (e.g.'โรงเรียน' -> ['โรง', 'เรียน']), this is because of
- BEST corpus (training data) tokenizes word this way (They use 'Compound words' as a criteria for segmentation)
- They are unseen/new words -> Ideally, this would be cured by having better corpus but it's not very practical so I am thinking of doing semi-supervised learning to incorporate new examples.
Any suggestion and comment are welcome, please post it in issue section.
Contributors
Citations
If you use deepcut in your project or publication, please cite the library as follows
Rakpong Kittinaradorn, Titipat Achakulvisut, Korakot Chaovavanich, Kittinan Srithaworn,
Pattarawat Chormai, Chanwit Kaewkasi, Tulakan Ruangrong, Krichkorn Oparad.
(2019, September 23). DeepCut: A Thai word tokenization library using Deep Neural Network. Zenodo. http://doi.org/10.5281/zenodo.3457707
or BibTeX entry:
@misc{Kittinaradorn2019,
author = {Rakpong Kittinaradorn, Titipat Achakulvisut, Korakot Chaovavanich, Kittinan Srithaworn, Pattarawat Chormai, Chanwit Kaewkasi, Tulakan Ruangrong, Krichkorn Oparad},
title = {{DeepCut: A Thai word tokenization library using Deep Neural Network}},
month = Sep,
year = 2019,
doi = {10.5281/zenodo.3457707},
version = {1.0},
publisher = {Zenodo},
url = {http://doi.org/10.5281/zenodo.3457707}
}
Partner Organizations
- True Corporation
We are open for contribution and collaboration.
Related Skills
claude-opus-4-5-migration
111.6kMigrate prompts and code from Claude Sonnet 4.0, Sonnet 4.5, or Opus 4.1 to Opus 4.5
model-usage
353.1kUse CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.
TrendRadar
51.3k⭐AI-driven public opinion & trend monitor with multi-platform aggregation, RSS, and smart alerts.🎯 告别信息过载,你的 AI 舆情监控助手与热点筛选工具!聚合多平台热点 + RSS 订阅,支持关键词精准筛选。AI 智能筛选新闻 + AI 翻译 + AI 分析简报直推手机,也支持接入 MCP 架构,赋能 AI 自然语言对话分析、情感洞察与趋势预测等。支持 Docker ,数据本地/云端自持。集成微信/飞书/钉钉/Telegram/邮件/ntfy/bark/slack 等渠道智能推送。
mcp-for-beginners
15.8kThis open-source curriculum introduces the fundamentals of Model Context Protocol (MCP) through real-world, cross-language examples in .NET, Java, TypeScript, JavaScript, Rust and Python. Designed for developers, it focuses on practical techniques for building modular, scalable, and secure AI workflows from session setup to service orchestration.
