STTM
Short Text Topic Modeling, JAVA
Install / Use
/learn @qiang2100/STTMREADME
STTM: A Library of Short Text Topic Modeling
This is a Java (Version=1.8) based open-source library for short text topic modeling algorithms. The library is designed to facilitate the development of short text topic modeling algorithms and make comparisons between the new models and existing ones available. STTM is open-sourced at Here.
STTM is maintained by Jipeng Qiang (Yangzhou, China).
<center style="padding: 40px"><img width="70%" src="https://github.com/qiang2100/STTM/blob/master/Architecture.png" /></center>Table of Contents
Algorithms
-
Short text topic models: Dirichlet Multinomial Mixture (DMM) in conference KDD2014, Biterm Topic Model (BTM) in journal TKDE2016, Word Network Topic Model (WNTM ) in journal KAIS2018, Pseudo-Document-Based Topic Model (PTM) in conference KDD2016, Self-Aggregation-Based Topic Model (SATM) in conference IJCAI2015, (ETM) in conference PAKDD2017, Generalized P´olya Urn (GPU) based Dirichlet Multinomial Mixturemodel (GPU-DMM) in conference SIGIR2016, Generalized P´olya Urn (GPU) based Poisson-based Dirichlet Multinomial Mixturemodel (GPU-PDMM) in journal TIS2017 and Latent Feature Model with DMM (LF-DMM) in journal TACL2015.
-
Long text topic models: Latent Dirichlet Allocation (LDA) and Latent Feature Model with LDA (LF-LDA) in journal TACL2015.
Here, LF-DMM and LF-LDA are package LFTM from https://github.com/datquocnguyen/LFTM.
Datasets
We provided the following six short text datasets for evaluation. The summary statistics and semantic topics of these datasets (SearchSnippets, StackOverflow and Biomedical) are described in the paper. The statistics of the two datasets (Tweet and GoogleNews) are described in the "DMM" paper. K is the number of topics. Num. is the number of documents in this dataset. Len. is the average/maximum length. V is the size of the vocabulary.
| Dataset | K | Num. | Len.| V| | -------- | -----: | :----: |:----: |:----: | | SearchSnippets | 8 | 12,295 |14,4/37| 5,547| | StackOverflow | 20 | 16,407 | 5.03/17 | 2638 | | Biomedical | 20 | 19,448 | 7.44/28 | 4498 | | Tweet | 89 | 2472 | 8.55/20 | 5098 | | GoogleNews | 152 | 11,109 | 6.23/14 | 8110 | | Pascal_Flickr | 20 | 4834 | 5.37/19 | 3431 |
-
SearchSnippets: This dataset was selected from the results of web search transaction using predefined phrases of 8 different domains.
-
StackOverflow: This is the challenge data published in Kaggle.com. The raw dataset consists 3,370,528 samples through July 31st, 2012 to August 14, 2012. Here, the dataset randomly select 20,000 question titles from 20 different tags.
-
Biomedical. Biomedical use the challenge data published in BioASQ's official website.
-
Tweet: In the 2011 and 2012 microblog tracks at Text REtrieval Conference (TREC)2 , totally 109 queries were used. Using a standard polling strategy, the NIST assessors evaluated the tweets submitted for each query by the participants into: spam, not relevant, relevant, and highly-relevant. We regard the queries as clusters and the highly-relevant tweets of each query as documents in each cluster. After removing the queries with none highly-relevant tweets, we constructed a dataset with 89 clusters and totally 2,472 tweets.
-
GoogleNews: In the Google News, the news articles are grouped into clusters (stories) automatically. We took a snapshot of the Google News on November 27, 2013, and crawled the titles and snippets of 11,109 news articles belonging to 152 clusters.
-
Pascal_Flickr: The Pascal Captions dataset are sets of captions solicited from Mechanical Turkers for photographs from Flickr and from the Pattern Analysis, Statistical Modeling, and Computational Learning (PASCAL) Visual Object Classes Challenge (Everingham et al., 2010). PAS includes twenty categories of images and 4834 captions. Each category has fifty images with approximately five captions for each image. We use the category as the gold standard cluster.
Evaluation
-
Topic coherence: Computing topic coherence, additional dataset (Wikipedia) as a single meta-document is needed to score word pairs using term cooccurrence in the paper (Automatic Evaluation of Topic Coherence). Here, we calculate the pointwise mutual information (PMI) of each word pair, estimated from the entire corpus of over one million English Wikipedia articles. Using a sliding window of 10- words to identify co-occurrence, we computed the PMI of all a given word pair. The wikipedia can downloaded Here. Then, we can transfer the dataset from html to text using the code through executing "python process_wiki.py enwiki-latest-pages-articles.xml.bz2 wiki.en.text". Finally, due to the large size, we only choose a part of them.
-
Cluster Evaluation (Purity and NMI): By choosing the maximum of topic probability for each text, we can get the cluster label for each text. Then, we can compare the cluster label and the golden label using metric Purity and NMI.
-
Classification Evaluation: With topic modeling, we can represent each document with its topic distribution p(z|d). Hence, the quality of the topics can be assessed by the accuracy of text classification using topic-level representation, as an indirect evaluation. A better classification accuracy means the learned topics are more discriminative and representative. Here, we employ a linear kernel Support Vector Machine (SVM) classifier in LIBLINEAR with default parameter settings. The classification accuracy is computed through fivefold cross-validation on both datasets.
Quickstart
Step 1: Infer latent topics from the corpus
Users can find the pre-compiled file STTM.jar and source codes in folders src, respectively. The users can recompile the source codes by Exclipse or IDEA.
File format of input corpus: Similar to file corpus.txt in the dataset folder, STTM assumes that each line in the input corpus represents a document. Here, a document is a sequence of words/tokens separated by white space characters. The users should preprocess the input corpus before training the short text topic models, for example: down-casing, removing non-alphabetic characters and stop-words, removing words shorter than 3 characters and words appearing less than a certain times.
Now, we can train the algorithms in STTM tool by executing:
$ java [-Xmx1G] -jar jar/STTM.jar –model <LDA or BTM or PTM or SATM or DMM or WATM> -corpus <Input_corpus_file_path> [-ntopics <int>] [-alpha <double>] [-beta <double>] [-niters <int>] [-twords <int>] [-name <String>] [-sstep <int>]
!!! note "Note" If users train these models based word embeddings, users need to download the Pre-trained word embeddings. In the package, the code is based on Global Vectors.
$ java [-Xmx1G] -jar jar/STTM.jar –model <GPUDMM or GPU-PDMM or LFDMM or LFLDA> -corpus <Input_corpus_file_path> -vectors <Input_Word2vec_file_Path> [-ntopics <int>] [-alpha <double>] [-beta <double>] [-niters <int>] [-twords <int>] [-name <String>] [-sstep <int>]
where parameters in [ ] are optional. More parameters in different methods are shown in "src/utility/CmdArgs"
-model: Specify the topic model LDA or DMM
-corpus: Specify the path to the input corpus file.
-vectors: Specify the path to the word2vec file.
-ntopics <int>: Specify the number of topics. The default value is 20.
-alpha <double>: Specify the hyper-parameter alpha. Following [6, 8], the default alpha value is 0.1.
-beta <double>: Specify the hyper-parameter beta. The default beta value is 0.01 which is a common setting in literature.
-niters <int>: Specify the number of Gibbs sampling iterations. The default value is 1000.
-twords <int>: Specify the number of the most probable topical words. The default value is 20.
-name <String>: Specify a name to the topic modeling experiment. The default value is model.
-sstep <int>: Specify a step to save the sampling outputs. The default value is 0 (i.e. only saving the output from the last sample).
Examples:
$ java -jar jar/STTM.jar -model BTM -corpus dataset/corpus.txt -name corpusBTM
The output files are saved in the "results" folder containing corpusBTM.theta, corpusBTM.phi, corpusBTM.topWords, corpusBTM.topicAssignments and corpusBTM.paras referring to the document-to-topic distributions, topic-to-word distributions, top topical words, topic assignments and model parameters, respectively.
Step 2: Evaluation the inferring models using Clustering, Coherence or Classification
For clustering, we treat each topic as a cluster, and we assign every document the topic with the highest probability given the document. T
Related Skills
node-connect
349.9kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
109.8kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
349.9kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
349.9kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
