Gibbs Samping Dirichlet Multinomial Mixture Model (GSDMM) in Short-Text Clustering

Computational Linguistics Final Project, Winter Semester 2019, University of Saarland
GSDMM Implementation as described in:
Yin, J. and Wang., J. A dirichlet multinomial mixturemodel-based approach for short text clustering. In SIGKDD,2014
Experimenting with different beta parameters on the Stack Overflow Titles dataset made available by Kaggle.com from the paper by:
Xu, J., et al., 2015. Short Text Clustering via Convolutional Neural Networks, NAACL.

Project File Structure

GSDMM
- README.md
- gsdmm_noonPokaratsiriGoldstein.pdf project report
- data: all corpus and label files are here
  - title_StackOverflow.txt
  - label_StackOverflow.txt
- logs: logs of run_gsdmm.py execution
  - run_gsdmm_{run_id}.log
- output: plots of gsdmm performance and representative words in clusters
  - cluster_per_iteration_at_different_beta.png
  - performance_at_different_beta.png
  - gsdmm_clusters_and_representative_words_{run_id}.out
- pickled: pickle files from run_gsdmm.py
  - predicted_{run_id}_freq_words_by_beta.pickle
  - predicted_{run_id}_labels_by_beta.pickle
  - predicted_{run_id}_num_clusters_by_it_per_beta_list.pickle
  - true_most_frequent_words_by_topic.pickle
- source_code: config file for default parameters and all source code files
  execute run_gsdmm.py from this directory
  - default_config.cfg default parameters to execute the program are defined here
  - eval.py this module calculates NMI, Homogeneity, and Completeness and plot graphs
  - gsdmm.py this module does the GSDMM algorithm
  - preprocess.py this module tokenizes and pre process the corpus file
  - run_gsdmm.py this is the main program that runs the experiment

Requirements

Python 3.7

numpy
sklearn
matplotlib
nltk
tqdm

Instructions

cd to the source_code directory to execute the program
python run_gsdmm.py -h will display all the command line options
commandline options will override options in the default_config.cfg file
python run_gsdmm.py will run GSDMM experiments with the default values in the .cfg file
the last run_id was 3; change to a different run_id number to execute the full program
program will output 2 plots (plot titles are self-explanatory), an output file showing the GSDMM predicted number of clusters, words in the clusters + frequencies
running the program with the same run_id will simply load data from pickled files and re-plot the 2 graphs
runtime: for K = 100 (starting with 100 clusters as an upper bound), the program takes approximately 1 hour for each beta value computation in the experiment. For K = 50, each cycle takes approximately 30 minutes.
The default setting experiments with 5 beta values; therefore, the total runtime for the entire program takes approximately 5-6 hours.
Please see the log file for runtime details as it includes time stamps from the last run

Gsdmm

Install / Use

README

Gibbs Samping Dirichlet Multinomial Mixture Model (GSDMM) in Short-Text Clustering

Project File Structure

Requirements

Instructions