SkillAgentSearch skills...

Pythainlp

Thai natural language processing in Python

Install / Use

/learn @PyThaiNLP/Pythainlp

README

PyThaiNLP: Thai Natural Language Processing in Python

Project Logo

pypi Python 3.9 License DOI Project Status: Active Codacy Grade Coverage Status Google Colab Badge Facebook Chat on Matrix

pythainlp.org | Tutorials | License info | Model cards | Adopters | เอกสารภาษาไทย

Designed to be a Thai-focused counterpart to NLTK, PyThaiNLP provides standard tools for linguistic analysis under an Apache-2.0 license, with its data and models covered by CC0-1.0 and CC-BY-4.0.

pip install pythainlp

| Version | Python version | Changes | Documentation | |:-------:|:--------------:|:-------:|:-------------:| | 5.3.3 | 3.9+ | Log | pythainlp.org/docs | | dev | 3.9+ | Log | pythainlp.org/dev-docs |

Features

  • Linguistic units: Sentence, word, and subword segmentation (sent_tokenize, word_tokenize, subword_tokenize).

  • Tagging: Part-of-speech tagging (pos_tag).

  • Transliteration: Romanization (transliterate) and IPA conversion.

  • Correction: Spelling suggestion and correction (spell, correct).

  • Utilities: Soundex, collation, number-to-text (bahttext), datetime formatting (thai_strftime), and keyboard layout correction.

  • Data: Built-in Thai character sets, word lists, and stop words.

  • CLI: Command-line interface via thainlp.

    thainlp data catalog  # List datasets
    thainlp help          # Show usage
    

Installation options

To install with specific extras (e.g., translate, wordnet, full):

pip install "pythainlp[extra1,extra2,...]"

Possible extras included:

  • compact — install a stable and small subset of dependencies (recommended)
  • translate — machine translation support
  • wordnet — WordNet support
  • full — install all optional dependencies (may introduce conflicts)

The documentation website maintains the full list of extras. To see the specific libraries included in each extra, please inspect the [project.optional-dependencies] section of pyproject.toml.

Environment variables

| Variable | Description | Status | |---|---|---| | PYTHAINLP_DATA | Path to the data directory (default: ~/pythainlp-data). | Current | | PYTHAINLP_DATA_DIR | Legacy alias for PYTHAINLP_DATA. Emits a DeprecationWarning. Setting both raises ValueError. | Deprecated; use PYTHAINLP_DATA | | PYTHAINLP_OFFLINE | Set to 1 to disable automatic corpus downloads. Explicit download() calls still work. | Current | | PYTHAINLP_READ_ONLY | Set to 1 to enable read-only mode, which prevents implicit background writes to PyThaiNLP's internal data directory (corpus downloads, catalog updates, directory creation). Explicit user-initiated saves to user-specified paths are unaffected. | Current | | PYTHAINLP_READ_MODE | Legacy alias for PYTHAINLP_READ_ONLY. Emits a DeprecationWarning. Setting both raises ValueError. | Deprecated; use PYTHAINLP_READ_ONLY |

Data directory

PyThaiNLP downloads data (see the data catalog db.json at pythainlp-corpus) to ~/pythainlp-data by default. Set the PYTHAINLP_DATA environment variable to override this location. (PYTHAINLP_DATA_DIR is still accepted but deprecated.)

When using PyThaiNLP in distributed computing environments (e.g., Apache Spark), set the PYTHAINLP_DATA environment variable inside the function that will be distributed to worker nodes. See details in the documentation.

Offline mode

Set PYTHAINLP_OFFLINE=1 to disable automatic corpus downloads. When this variable is set and a corpus is not already cached locally, a FileNotFoundError is raised instead of attempting a network download. Explicit calls to pythainlp.corpus.download() are unaffected. Use pythainlp.is_offline_mode() to check the current state programmatically.

import pythainlp
print(pythainlp.is_offline_mode())  # True if PYTHAINLP_OFFLINE=1

Read-only mode

Set PYTHAINLP_READ_ONLY=1 to prevent implicit background writes to PyThaiNLP's internal data directory. This blocks corpus downloads, catalog updates, and automatic data directory creation — writes that happen as side effects the user may not be aware of.

Note: Read-only mode is more restrictive than offline mode. PYTHAINLP_OFFLINE=1 blocks only automatic downloads triggered by get_corpus_path(); explicit pythainlp.corpus.download() calls still work. PYTHAINLP_READ_ONLY=1 also blocks explicit download() calls, because any download requires writing to the data directory. Use PYTHAINLP_READ_ONLY when the data directory is on a read-only file system (e.g., a read-only Docker volume or a shared cluster mount).

Operations where the user explicitly specifies an output path are unaffected (e.g., model.save("path"), tagger.train(..., save_loc="path"), thainlp misspell --output myfile.txt).

Use pythainlp.is_read_only_mode() to check the current state programmatically.

import pythainlp
print(pythainlp.is_read_only_mode())  # True if PYTHAINLP_READ_ONLY=1

Testing

We test core functionalities on all officially supported Python versions.

See tests/README.md for test matrix and other details.

Contribute to PyThaiNLP

Please fork and create a pull request. See CONTRIBUTING.md for guidelines and algorithm references.

Citations

If you use PyThaiNLP library in your project, please cite the software as follows:

Phatthiyaphaibun, Wannaphong, Korakot Chaovavanich, Charin Polpanumas, Arthit Suriyawongkul, Lalita Lowphansirikul, and Pattarawat Chormai. “PyThaiNLP: Thai Natural Language Processing in Python”. Zenodo, 2 June 2024. https://doi.org/10.5281/zenodo.3519354.

with this BibTeX entry:

@software{pythainlp,
    title = "{P}y{T}hai{NLP}: {T}hai Natural Language Processing in {P}ython",
    author = "Phatthiyaphaibun, Wannaphong  and
      Chaovavanich, Korakot  and
      Polpanumas, Charin  and
      Suriyawongkul, Arthit  and
      Lowphansirikul, Lalita  and
      Chormai, Pattarawat",
    doi = {10.5281/zenodo.3519354},
    license = {Apache-2.0},
    month = jun,
    url = {https://github.com/PyThaiNLP/pythainlp/},
    version = {v5.0.4},
    year = {2024},
}

To cite our NLP-OSS 2023 academic paper, please cite the paper as follows:

Wannaphong Phatthiyaphaibun, Korakot Chaovavanich, Charin Polpanumas, Arthit Suriyawongkul, Lalita Lowphansirikul, Pattarawat Chormai, Peerat Limkonchotiwat, Thanathip Suntorntip, and Can Udomcharoenchaikit. 2023. PyThaiNLP: Thai Natural Language Processing in Python. In Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023), pages 25–36, Singapore, Singapore. Empirical Methods in Natural Language Processing.

with this BibTeX entry:

@inproceedings{phatthiyaphaibun-etal-2023-pythainlp,
    title = "{P}y{T}hai{NLP}: {T}hai Natural Language Processing in {P}ython",
    author = "Phatthiyaphaibun, Wannaphong  and
      Chaovavanich, Korakot  and
      Polpanumas, Charin  and
      Suriyawongkul, Arthit  and
      Lowphansirikul, Lalita  and
      Chormai, Pattarawat  and
      Limkonchotiwat, Peerat  and
      Suntorntip, Thanathip  and
      Udomcharoenchaikit, Can",
    editor = "Tan, Liling  and
      Milajevs, Dmitrijs  and
      Chauhan, Geeticka  and
      Gwinnup, Jeremy  and
      Rippeth, Elijah",
    booktitle = "Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023)",
    month = dec,
    year = "2023",
    address = "Singapore, Singapore",
    publisher = "Empirical Methods in Natural Language Processing",
    url = "
View on GitHub
GitHub Stars1.1k
CategoryDevelopment
Updated5h ago
Forks293

Languages

Python

Security Score

100/100

Audited on Mar 31, 2026

No findings