Detailed Setup for Exploring Topic Coherence over many models and many topics

We (Keith Stevens, Philip Kegelmeyer, David Andrzejewski, and David Buttler) published the paper [Exploring Topic Coherence over many models and many topics][1] (link to appear soon) which compares several topic models using a variety of measures in an attempt to determine which model should be used in which application. This evaluation secondly compares automatic coherence measures as a quick, task free method for comparing a variety of models. Below is a detailed series of steps on how to replicate the results from the paper.

The evaluation setup breaks down into the following steps:

Select a corpus and pre-process.
Remove stop words, infrequent words, and format the corpus.
Perform topic modelling on all documents
Compute topic coherence measures for induced topics
Compute word similarities using semantic pairing tests
Compute Classifier accuracy using induced topics

Each of these steps are automated in the bash scripts provided in this repository. To run those scripts read the last section for downloading the needed components, setting parameters, and then watching the scripts blaze through the setup.

The rest of this writeup explains each step in more detail than was permitted in the published paper.

Selecting the corpus

The evaluation requires the use of a semantically labeled corpus that has a relatively cohesive focus. The original paper used all articles from 2003 of the [New York Times Annotated Corpus][2] provided by the [Linguistics Data Consortium][3].
Any similarly structured corpus should work.

The New York Times corpus requires some pre-processing before it can be easily used in the evaluation. The original corpus comes in a series of tarballed xml files where each file looks something like this:

<nitf change.date="month day, year" change.time="HH:MM" version="-//IPTC//DTD NITF 3.3//EN">
<head>
  <title>Article Title</title>
  <meta content="Section Name" name="online_sections"/>
</head>
<body>
  <body.contents>
    <block class="full_text">
      <p>Article text</p>
    </block>
  </body.contents>
</body>
</nitf>

This leaves out a lot of details, but it covers the key items we will need: (1) the full text of the article and (2) all online_sections for the article. Extracting this can be kinda hairy. The following snippet gives a gist of how to extract and format the necessary data:

import scala.xml.XML

val doc = XML.loadFile(file)
val sections = (doc \\ "meta").filter(node => (node \ "@name").text == "online_sections")
                              .map(node => (node \ "@content").text)
                              .mkString(";")
val text = (doc \\ "block").filter(node => (node \ "@class").text == "full_text")
                           .flatMap(node => (node \ "p").map(_.text.replace("\n", " ").trim))
                           .mkString(" ")

Before printing the data, we also need to tokenize everything. We used the Open NLP MaxEnt tokenizers. First download the english MaxEnt tokenizer model [here][4] then do the following before processing each document

val tokenizerModel = new TokenizerModel(new FileInputStream(modelFileName))
val tokenizer = new TokenizerME(tokenizerModel)
val stopWords = Source.fromFile(args(1)).getLines.toSet
def acceptToken(token: String) = !stopWords.contains(token)

And then do the following to each piece of text extracted:

val tokenizedText = tokenizer.toLowerCase.tokenize(text).filter(acceptToken).mkString(" ")
printf("%s\t%s\n", sections, tokenizedText)

This should generate one line per document in the format

section_1(;section_n)+<TAB>doc_text

With properly tokenized text and a series of stop words removed..

Filtering tokens

In order to limit the memory requirements of our processing steps, we discard any word that is not in the list of word similarity pairs or the top 100k most frequent tokens in the corpus. The following bash lines will accomplish this:

cat $oneDocFile | cut -f 2 | tr " " "\n" | sort | uniq -c | \ 
                  sort -n -k 1 -r | head -n 100000 | \
                  awk '{ print $2 }' > $topTokens
cat wordsim*.terms.txt $topTokens | uniq > .temp.txt
mv .temp.txt $topTokens

Once we've gotten the top tokens that'll be used during processing, we do one more filtering of the corpus to reduce each document down to only the accepted words and discard any documents that contain no useful content words. Running [FilterCorpus][4] with the top tokens file and the corpus file will return a properly filtered corpus.

Topic Modeling

With all the pre-processing completed, we can now generate topics for the corpus. We do this using two different methods (1) Latent Dirichlet Allocation and (2) Latent Semantic Analysis. Unless otherwise stated, we we performed topic modeling using each method for 1 to 100 topics, and for 110 to 500 topics with steps of 10.

Processing for Latent Dirichlet Allocation

We use [Mallet's][5] fast parallel implementaiton of Latent Dirichlet Allocation to do the topic modelling. Since [Mallet's][5] interface does not let us easily limit the set of tokens or set the indices we want each token to have, we provide a class to do this: [TopicModelNewYorkTimes][2]. This takes five arguments

The set of content words to represent
Number of top words to report for each topic
The documents to represent
The number of topics
A name for the output data.

And we run this with the following command.

scala edu.ucla.sspace.TopicModelNewYorkTimes $topTokens 10 $oneDocFile $nTopics nyt03_LDA_$nTopics

for the specified range of topics. The command will then perform LDA and store the term by topic matrix in nyt03_LDA_$nTopics-ws.dat, the document by topic matrix in nyt03_LDA_$nTopics-ds.dat, and the top 10 words for each topic in nyt03_LDA_$nTopics.top10.

Processing for Latent Semantic Analysis

Latent Semantic Analysis at it's core decomposes a term by document matrix into two smaller latent matrices using one of two methods: (1) [Singular Value Decomposition][6] and (2) [Non-negative Matrix Factorization][7]. We do this in two steps:

Build a weighted term document matrix.
Factorize the matrix using either SVD or NMF.

We use the [BuildTermDocMatrix][8] class to perform the first step. It takes four arguments:

A list of words to represent
A feature transformation method, valid options are tfidf, logentropy, and none
The corpus to represent
An output filename

We run this once on our properly formatted corpus using the top set of tokens using this command

scala edu.ucla.sspace.BuildTermDocMatrix $topTokens logentropy $oneDocFile $oneDocFile.mat

With the term document matrix, we then decompose it using the [MatrixFactorNewYorkTimes][9] method, which uses either SVD or NMF to decompose the matrix and stores a term by latent factor matrix and a document by latent factor matrix to disk. A sample run of this looks like:

scala edu.ucla.sspace.MatrixFactorNewYorkTimes $oneDocFile.mat nmf 10 nyt03_NMF_10-ws.dat nyt03_NMF_10-ds.dat

Which will decompose the term doc matrix using 10 latent features, or topics, and store the term by topic matrix in nyt03_NMF_10-ws.dat and the document by topic matrix in nyt03_NMF_10-ds.dat. Because the SVD is deterministic and the result for 500 topics includes the results for all smaller topics, we do this just once for the SVD with 500 topics and use the appropriate number of SVD-based topics later on.

After producing all the decompositions, we extract the top terms for each model using [ExtractTopTerms][10]. A run of this looks like

scala edu.ucla.sspace.ExtractTopTerms $topTokens $nTopics nyt03_NMF_$nTopics-ws.dat > nyt03_NMF_10.top10

Computing Topic Coherence for all topics

Computing the topic coherence depends critically on computing some similarity value between two words that appear in the same topic. We do this in a multi-step process:

Compute the list of all words appearing in any topic
Compute the Pointwise Mutual Information scores between all listed words within an external corpus (for the UCI metric)
Compute document Co-Occurence scores for all listed words in the New York Times corpus (for the UMass metric)
Start a server for each set of scores and query the server for the coherence of each topic

To compute the set of all words appearing in any topic, we just use this bash command:

cat *.top10 | tr " " "\n" | sort -u > $allTopicTerms

The [ExtractUCIStats][11] class will do just as it says, extract the raw scores needed for the UCI metric, i.e. Pointwise Mutual Information scores between each topic word as they appear within a sliding window of K words in an external corpus. We use a sliding window of 20 words and we use the [Wackypedia][12] corpus as our external dataset. Similarly, [ExtractUMass][13] will extract the raw scores needed for the UMass metric, i.e. document co-occurence counts for topic words as they appear in the New York Times corpus. These two commands will run theses classes as desired:

scala edu.ucla.sspace.ExtractUCIStats $allTopicTerms $uciStatsMatrix $externalCorpusFile
scala edu.ucla.sspace.ExtractUMassStats $allTopicTerms $umassStatsMatrix $oneDocFile

Then, for each metric, we startup an [Avro][14] based [CoherenceServer][15] that will compute the coherence of a topic using the raw scores computed between individual words. This server works the same with both sets of scores computed above, the only change is the input matrix. We then query the server for each topic and record the computed coherence. A key argument for computing the coherence is the epsilon value used to smooth the coherence scores such that they remain real valued. These two commands will start the server and query the server for a set of to

TopicModelComparison

Install / Use

README