Pythainlp
Thai natural language processing in Python
Install / Use
/learn @PyThaiNLP/PythainlpREADME
PyThaiNLP: Thai Natural Language Processing in Python

pythainlp.org | Tutorials | License info | Model cards | Adopters | เอกสารภาษาไทย
Designed to be a Thai-focused counterpart to NLTK, PyThaiNLP provides standard tools for linguistic analysis under an Apache-2.0 license, with its data and models covered by CC0-1.0 and CC-BY-4.0.
pip install pythainlp
| Version | Python version | Changes | Documentation |
|:-------:|:--------------:|:-------:|:-------------:|
| 5.3.3 | 3.9+ | Log | pythainlp.org/docs |
| dev | 3.9+ | Log | pythainlp.org/dev-docs |
Features
-
Linguistic units: Sentence, word, and subword segmentation (
sent_tokenize,word_tokenize,subword_tokenize). -
Tagging: Part-of-speech tagging (
pos_tag). -
Transliteration: Romanization (
transliterate) and IPA conversion. -
Correction: Spelling suggestion and correction (
spell,correct). -
Utilities: Soundex, collation, number-to-text (
bahttext), datetime formatting (thai_strftime), and keyboard layout correction. -
Data: Built-in Thai character sets, word lists, and stop words.
-
CLI: Command-line interface via
thainlp.thainlp data catalog # List datasets thainlp help # Show usage
Installation options
To install with specific extras (e.g., translate, wordnet, full):
pip install "pythainlp[extra1,extra2,...]"
Possible extras included:
compact— install a stable and small subset of dependencies (recommended)translate— machine translation supportwordnet— WordNet supportfull— install all optional dependencies (may introduce conflicts)
The documentation website maintains the
full list of extras.
To see the specific libraries included in each extra,
please inspect the [project.optional-dependencies] section of
pyproject.toml.
Environment variables
| Variable | Description | Status |
|---|---|---|
| PYTHAINLP_DATA | Path to the data directory (default: ~/pythainlp-data). | Current |
| PYTHAINLP_DATA_DIR | Legacy alias for PYTHAINLP_DATA. Emits a DeprecationWarning. Setting both raises ValueError. | Deprecated; use PYTHAINLP_DATA |
| PYTHAINLP_OFFLINE | Set to 1 to disable automatic corpus downloads. Explicit download() calls still work. | Current |
| PYTHAINLP_READ_ONLY | Set to 1 to enable read-only mode, which prevents implicit background writes to PyThaiNLP's internal data directory (corpus downloads, catalog updates, directory creation). Explicit user-initiated saves to user-specified paths are unaffected. | Current |
| PYTHAINLP_READ_MODE | Legacy alias for PYTHAINLP_READ_ONLY. Emits a DeprecationWarning. Setting both raises ValueError. | Deprecated; use PYTHAINLP_READ_ONLY |
Data directory
PyThaiNLP downloads data (see the data catalog db.json at
pythainlp-corpus)
to ~/pythainlp-data by default.
Set the PYTHAINLP_DATA environment variable to override this location.
(PYTHAINLP_DATA_DIR is still accepted but deprecated.)
When using PyThaiNLP in distributed computing environments
(e.g., Apache Spark), set the PYTHAINLP_DATA environment variable
inside the function that will be distributed to worker nodes.
See details in
the documentation.
Offline mode
Set PYTHAINLP_OFFLINE=1 to disable automatic corpus downloads.
When this variable is set and a corpus is not already cached locally,
a FileNotFoundError is raised instead of attempting a network download.
Explicit calls to pythainlp.corpus.download() are unaffected.
Use pythainlp.is_offline_mode() to check the current state programmatically.
import pythainlp
print(pythainlp.is_offline_mode()) # True if PYTHAINLP_OFFLINE=1
Read-only mode
Set PYTHAINLP_READ_ONLY=1 to prevent implicit background writes to PyThaiNLP's
internal data directory. This blocks corpus downloads, catalog updates, and
automatic data directory creation — writes that happen as side effects the user
may not be aware of.
Note: Read-only mode is more restrictive than offline mode.
PYTHAINLP_OFFLINE=1blocks only automatic downloads triggered byget_corpus_path(); explicitpythainlp.corpus.download()calls still work.PYTHAINLP_READ_ONLY=1also blocks explicitdownload()calls, because any download requires writing to the data directory. UsePYTHAINLP_READ_ONLYwhen the data directory is on a read-only file system (e.g., a read-only Docker volume or a shared cluster mount).
Operations where the user explicitly specifies an output path are unaffected
(e.g., model.save("path"), tagger.train(..., save_loc="path"),
thainlp misspell --output myfile.txt).
Use pythainlp.is_read_only_mode() to check the current state programmatically.
import pythainlp
print(pythainlp.is_read_only_mode()) # True if PYTHAINLP_READ_ONLY=1
Testing
We test core functionalities on all officially supported Python versions.
See tests/README.md for test matrix and other details.
Contribute to PyThaiNLP
Please fork and create a pull request. See CONTRIBUTING.md for guidelines and algorithm references.
Citations
If you use PyThaiNLP library in your project,
please cite the software as follows:
Phatthiyaphaibun, Wannaphong, Korakot Chaovavanich, Charin Polpanumas, Arthit Suriyawongkul, Lalita Lowphansirikul, and Pattarawat Chormai. “PyThaiNLP: Thai Natural Language Processing in Python”. Zenodo, 2 June 2024. https://doi.org/10.5281/zenodo.3519354.
with this BibTeX entry:
@software{pythainlp,
title = "{P}y{T}hai{NLP}: {T}hai Natural Language Processing in {P}ython",
author = "Phatthiyaphaibun, Wannaphong and
Chaovavanich, Korakot and
Polpanumas, Charin and
Suriyawongkul, Arthit and
Lowphansirikul, Lalita and
Chormai, Pattarawat",
doi = {10.5281/zenodo.3519354},
license = {Apache-2.0},
month = jun,
url = {https://github.com/PyThaiNLP/pythainlp/},
version = {v5.0.4},
year = {2024},
}
To cite our NLP-OSS 2023 academic paper, please cite the paper as follows:
Wannaphong Phatthiyaphaibun, Korakot Chaovavanich, Charin Polpanumas, Arthit Suriyawongkul, Lalita Lowphansirikul, Pattarawat Chormai, Peerat Limkonchotiwat, Thanathip Suntorntip, and Can Udomcharoenchaikit. 2023. PyThaiNLP: Thai Natural Language Processing in Python. In Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023), pages 25–36, Singapore, Singapore. Empirical Methods in Natural Language Processing.
with this BibTeX entry:
@inproceedings{phatthiyaphaibun-etal-2023-pythainlp,
title = "{P}y{T}hai{NLP}: {T}hai Natural Language Processing in {P}ython",
author = "Phatthiyaphaibun, Wannaphong and
Chaovavanich, Korakot and
Polpanumas, Charin and
Suriyawongkul, Arthit and
Lowphansirikul, Lalita and
Chormai, Pattarawat and
Limkonchotiwat, Peerat and
Suntorntip, Thanathip and
Udomcharoenchaikit, Can",
editor = "Tan, Liling and
Milajevs, Dmitrijs and
Chauhan, Geeticka and
Gwinnup, Jeremy and
Rippeth, Elijah",
booktitle = "Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023)",
month = dec,
year = "2023",
address = "Singapore, Singapore",
publisher = "Empirical Methods in Natural Language Processing",
url = "
