PyShortTextCategorization
Various Algorithms for Short Text Mining
Install / Use
/learn @stephenhky/PyShortTextCategorizationREADME
Short Text Mining in Python
Introduction
This package shorttext is a Python package that facilitates supervised and unsupervised
learning for short text categorization. Due to the sparseness of words and
the lack of information carried in the short texts themselves, an intermediate
representation of the texts and documents are needed before they are put into
any classification algorithm. In this package, it facilitates various types
of these representations, including topic modeling and word-embedding algorithms.
The package shorttext runs on Python 3.9, 3.10, 3.11, 3.12, and 3.13.
Characteristics:
- example data provided (including subject keywords and NIH RePORT);
- text preprocessing;
- pre-trained word-embedding support;
gensimtopic models (LDA, LSI, Random Projections) and autoencoder;- topic model representation supported for supervised learning using
scikit-learn; - cosine distance classification;
- neural network classification (including ConvNet, and C-LSTM);
- maximum entropy classification;
- metrics of phrases differences, including soft Jaccard score (using Damerau-Levenshtein distance), and Word Mover's distance (WMD);
- character-level sequence-to-sequence (seq2seq) learning; and
- spell correction.
Documentation
Documentation and tutorials for shorttext can be found here: http://shorttext.rtfd.io/.
See tutorial for how to use the package, and FAQ.
Installation
To install it, in a console, use pip.
>>> pip install shorttext
or, if you want the most recent development version on Github, type
>>> pip install git+https://github.com/stephenhky/PyShortTextCategorization@master
See installation guide for more details.
Issues
To report any issues, go to the Issues tab of the Github page and start a thread. It is welcome for developers to submit pull requests on their own to fix any errors.
Contributors
If you would like to contribute, feel free to submit the pull requests to the develop branch.
You can talk to me in advance through e-mails or the Issues page.
Useful Links
- Documentation: http://shorttext.readthedocs.io
- Github: https://github.com/stephenhky/PyShortTextCategorization
- PyPI: https://pypi.org/project/shorttext/
- "Package shorttext 1.0.0 released," Medium
- "Python Package for Short Text Mining", WordPress
- "Document-Term Matrix: Text Mining in R and Python," WordPress
- An earlier version of this repository is a demonstration of the following blog post: Short Text Categorization using Deep Neural Networks and Word-Embedding Models
News
- 03/22/2026:
shorttext3.1.1 released. - 03/02/2026:
shorttext3.1.0 reelased. - 10/27/2025:
shorttext3.0.1 released. - 08/10/2025:
shorttext3.0.0 released. - 06/02/2025:
shorttext2.2.1 released. (Acknowledgement: Minseo Kim) - 05/29/2025:
shorttext2.2.0 released. (Acknowledgement: Minseo Kim) - 05/08/2025:
shorttext2.1.1 released. - 12/14/2024:
shorttext2.1.0 released. - 07/12/2024:
shorttext2.0.0 released. - 12/21/2023:
shorttext1.6.1 released. - 08/26/2023:
shorttext1.6.0 released. - 06/19/2023:
shorttext1.5.9 released. - 09/23/2022:
shorttext1.5.8 released. - 09/22/2022:
shorttext1.5.7 released. - 08/29/2022:
shorttext1.5.6 released. - 05/28/2022:
shorttext1.5.5 released. - 12/15/2021:
shorttext1.5.4 released. - 07/11/2021:
shorttext1.5.3 released. - 07/06/2021:
shorttext1.5.2 released. - 04/10/2021:
shorttext1.5.1 released. - 04/09/2021:
shorttext1.5.0 released. - 02/11/2021:
shorttext1.4.8 released. - 01/11/2021:
shorttext1.4.7 released. - 01/03/2021:
shorttext1.4.6 released. - 12/28/2020:
shorttext1.4.5 released. - 12/24/2020:
shorttext1.4.4 released. - 11/10/2020:
shorttext1.4.3 released. - 10/18/2020:
shorttext1.4.2 released. - 09/23/2020:
shorttext1.4.1 released. - 09/02/2020:
shorttext1.4.0 released. - 07/23/2020:
shorttext1.3.0 released. - 06/05/2020:
shorttext1.2.6 released. - 05/20/2020:
shorttext1.2.5 released. - 05/13/2020:
shorttext1.2.4 released. - 04/28/2020:
shorttext1.2.3 released. - 04/07/2020:
shorttext1.2.2 released. - 03/23/2020:
shorttext1.2.1 released. - 03/21/2020:
shorttext1.2.0 released. - 12/01/2019:
shorttext1.1.6 released. - 09/24/2019:
shorttext1.1.5 released. - 07/20/2019:
shorttext1.1.4 released. - 07/07/2019:
shorttext1.1.3 released. - 06/05/2019:
shorttext1.1.2 released. - 04/23/2019:
shorttext1.1.1 released. - 03/03/2019:
shorttext1.1.0 released. - 02/14/2019:
shorttext1.0.8 released. - 01/30/2019:
shorttext1.0.7 released. - 01/29/2019:
shorttext1.0.6 released. - 01/13/2019:
shorttext1.0.5 released. - 10/03/2018:
shorttext1.0.4 released. - 08/06/2018:
shorttext1.0.3 released. - 07/24/2018:
shorttext1.0.2 released. - 07/17/2018:
shorttext1.0.1 released. - 07/14/2018:
shorttext1.0.0 released. - 06/18/2018:
shorttext0.7.2 released. - 05/30/2018:
shorttext0.7.1 released. - 05/17/2018:
shorttext0.7.0 released. - 02/27/2018:
shorttext0.6.0 released. - 01/19/2018:
shorttext0.5.11 released. - 01/15/2018:
shorttext0.5.10 released. - 12/14/2017:
shorttext0.5.9 released. - 11/08/2017:
shorttext0.5.8 released. - 10/27/2017:
shorttext0.5.7 released. - 10/17/2017:
shorttext0.5.6 released. - 09/28/2017:
shorttext0.5.5 released. - 09/08/2017:
shorttext0.5.4 released. - 09/02/2017: end of GSoC project. (Report)
- 08/22/2017:
shorttext0.5.1 released. - 07/28/2017:
shorttext0.4.1 released. - 07/26/2017:
shorttext0.4.0 released. - 06/16/2017:
shorttext0.3.8 released. - 06/12/2017:
shorttext0.3.7 released. - 06/02/2017:
shorttext0.3.6 released. - 05/30/2017: GSoC project (Chinmaya Pancholi, with gensim)
- 05/16/2017:
shorttext0.3.5 released. - 04/27/2017:
shorttext0.3.4 released. - 04/19/2017:
shorttext0.3.3 released. - 03/28/2017:
shorttext0.3.2 released. - 03/14/2017:
shorttext0.3.1 released. - 02/23/2017:
shorttext0.2.1 released. - 12/21/2016:
shorttext0.2.0 released. - 11/25/2016:
shorttext0.1.2 released. - 11/21/2016:
shorttext0.1.1 released.
Acknowledgements
Related Skills
claude-opus-4-5-migration
99.2kMigrate prompts and code from Claude Sonnet 4.0, Sonnet 4.5, or Opus 4.1 to Opus 4.5
model-usage
344.4kUse CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.
TrendRadar
50.5k⭐AI-driven public opinion & trend monitor with multi-platform aggregation, RSS, and smart alerts.🎯 告别信息过载,你的 AI 舆情监控助手与热点筛选工具!聚合多平台热点 + RSS 订阅,支持关键词精准筛选。AI 智能筛选新闻 + AI 翻译 + AI 分析简报直推手机,也支持接入 MCP 架构,赋能 AI 自然语言对话分析、情感洞察与趋势预测等。支持 Docker ,数据本地/云端自持。集成微信/飞书/钉钉/Telegram/邮件/ntfy/bark/slack 等渠道智能推送。
mcp-for-beginners
15.7kThis open-source curriculum introduces the fundamentals of Model Context Protocol (MCP) through real-world, cross-language examples in .NET, Java, TypeScript, JavaScript, Rust and Python. Designed for developers, it focuses on practical techniques for building modular, scalable, and secure AI workflows from session setup to service orchestration.
