Slda
Supervised Latent Dirichlet Allocation for Classification
Install / Use
/learn @chbrown/SldaREADME
Supervised Latent Dirichlet Allocation for Classification
This is a C++ implementation of supervised latent Dirichlet allocation (sLDA) for classification.
Note that this code depends on the GNU Scientific Library.
Compiling
git clone https://github.com/chbrown/slda
cd slda
make
You may need to install the gsl first. E.g., on a Mac:
brew install gsl
Estimation
Estimate the model by executing:
slda est <data path> <label path> <settings path> <alpha> <k> <initialization> <output directory>
<data path>should point to a single file containing your training data.-
This should be a file where each line is of the form:
<M> <term_1>:<count> <term_2>:<count> ... <term_N>:<count> -
where
<M>is the number of unique terms in the document, and the [count] associated with each term is how many times that term appeared in the document. (For an example, see test/images/train-data.dat.)
-
<label path>points to a file of labels- Each line should consist of a single integer, starting with 0, up to C-1, if we have C classes.
- This file should have the same number of lines as the file specified by
<data path>.
<settings path>should point to a file with various settings, e.g., settings.txt<alpha>is a floating point hyperparameter (a prior)<k>is the number of topics<initialization>specifies the initialization method. There are three options:- "seeded"
- "random"
<model path>(a path to some pre-existing model)
<output directory>should point to a directory where the estimator's output will be stored. This directory will be created if it does not already exist.-
The estimator outputs models in two types of files:
<iteration>.modelis the model saved in the binary format, which is easy and fast to use for inference.<iteration>.model.textis the model saved in the text format, which is convenient for printing topics or further analysis using a scripting language.
-
It also produces variational posterior Dirichlets in a file called:
<iteration>.gamma
-
Running the estimator on the 8-class image dataset produces the output:
010.gamma 010.model 010.model.text 020.gamma 020.model 020.model.text final.gamma final.model final.model.text likelihood.dat word-assignments.dat
-
Example usage:
./slda est test/images/train-data.dat test/images/train-label.dat \
settings.txt 1.0 10 random tmp/
Inference
To perform inference on a different set of data (in the same format as for estimation), execute:
slda inf <data path> <label path> <settings path> <model path> <output directory>
<data path>,<label path>, and<settings path>are all the same as in the estimation step.<model path>is the binaryfinal.modelfile from the estimation step.<output directory>is the output directory, where the predicted labels will be stored.- Each output file has one line per input document.
inf-gamma.datdescribes the variational posterior Dirichletsinf-labels.datdisplays the predicted labelsinf-likelihood.datdepicts each document's likelihood
- Each output file has one line per input document.
Example usage:
./slda inf test/images/test-data.dat test/images/test-label.dat \
settings.txt tmp/final.model tmp/
This will also produce a final line of output, evaluating against the labels
specified in the <label path> argument:
average accuracy: 0.679
Sample data
The sample data in test/images was downloaded from
http://www.cs.cmu.edu/~chongw/data/images.tgz on July 12, 2013.
Description of data from original site:
A preprocessed 8-class image dataset from Labelme.
UIUC Sports annotation files: annotations and meta information. The source image files can be found here. (Note: there might be some discrepancies and I don't seem to know why...)
License
Copyright © 2009, Chong Wang, David Blei and Li Fei-Fei
Licensed under both the GPL v2 and GPL v3, as well as any future version of the GNU General Public License.
Related Skills
node-connect
351.4kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
110.7kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
351.4kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
351.4kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
