TopicSignificanceRank

Based on the information here: http://qpleple.com/topic-coherence-to-evaluate-topic-models/ Calculating the topic coherence for LDA through sklearn, rather than through Gensim

Goal

Wanted a method to find Cohernece using sklearn, rather than rely on the Gensim model or be stuck using Perplexity or Log-Likliehood as built into sklearn's sklearn.decomposition.LatentDirichletAllocation model.

Writng the Code

Used this website to make the various calculations. Note that UCI seems to be based on an external corpus, which is not how this was developed or tested.

Testing the Code

Based on this website, I've processed the Newsgroup dataset the same way they did, used the same CountVectorizer, and then ran my TopicCoherence class on the resulting vocabulary. Ran the same gridsearch they did. They got 10 as the best (using perplexity). According to sklearn, there should be 20 topics.

I got 25 topics. This may be partially due to the issues with UCI (see below). Overall, still better than perplexity

Using This

Though it was not specified here, other sources seem to indicate that UCI is based on a different corpus than the one in question, so keep this in mind. UMASS is based on the corpus itself. It is recommended to use one or the other (I have used UMASS in actual practice).

LDATopicCoherence

Install / Use

README

TopicSignificanceRank

Goal

Writng the Code

Testing the Code

Using This