SkillAgentSearch skills...

LDATopicCoherence

Building a scorer for grid searches based on topic coherence for use with sklearn's LDA model.

Install / Use

/learn @NeverForged/LDATopicCoherence
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

TopicSignificanceRank

Based on the information here: http://qpleple.com/topic-coherence-to-evaluate-topic-models/ Calculating the topic coherence for LDA through sklearn, rather than through Gensim

Goal

Wanted a method to find Cohernece using sklearn, rather than rely on the Gensim model or be stuck using Perplexity or Log-Likliehood as built into sklearn's sklearn.decomposition.LatentDirichletAllocation model.

Writng the Code

Used this website to make the various calculations. Note that UCI seems to be based on an external corpus, which is not how this was developed or tested.

Testing the Code

Based on this website, I've processed the Newsgroup dataset the same way they did, used the same CountVectorizer, and then ran my TopicCoherence class on the resulting vocabulary. Ran the same gridsearch they did. They got 10 as the best (using perplexity). According to sklearn, there should be 20 topics.

I got 25 topics. This may be partially due to the issues with UCI (see below). Overall, still better than perplexity

Using This

Though it was not specified here, other sources seem to indicate that UCI is based on a different corpus than the one in question, so keep this in mind. UMASS is based on the corpus itself. It is recommended to use one or the other (I have used UMASS in actual practice).

View on GitHub
GitHub Stars4
CategoryDevelopment
Updated2y ago
Forks1

Languages

Jupyter Notebook

Security Score

55/100

Audited on Feb 7, 2024

No findings