Slda

Supervised Latent Dirichlet Allocation for Classification

Generate Convert Improve

Install / Use

/learn @chbrown/Slda

About this skill

Quality Score

0/100

README

Supervised Latent Dirichlet Allocation for Classification

This is a C++ implementation of supervised latent Dirichlet allocation (sLDA) for classification.

Note that this code depends on the GNU Scientific Library.

Compiling

git clone https://github.com/chbrown/slda
cd slda
make

You may need to install the gsl first. E.g., on a Mac:

brew install gsl

Estimation

Estimate the model by executing:

slda est <data path> <label path> <settings path> <alpha> <k> <initialization> <output directory>

<data path> should point to a single file containing your training data.
- This should be a file where each line is of the form:
```
  <M> <term_1>:<count> <term_2>:<count> ... <term_N>:<count>
```
- where <M> is the number of unique terms in the document, and the [count] associated with each term is how many times that term appeared in the document. (For an example, see test/images/train-data.dat.)
<label path> points to a file of labels
- Each line should consist of a single integer, starting with 0, up to C-1, if we have C classes.
- This file should have the same number of lines as the file specified by <data path>.
<settings path> should point to a file with various settings, e.g., settings.txt
<alpha> is a floating point hyperparameter (a prior)
<k> is the number of topics
<initialization> specifies the initialization method. There are three options:
- "seeded"
- "random"
- <model path> (a path to some pre-existing model)
<output directory> should point to a directory where the estimator's output will be stored. This directory will be created if it does not already exist.
- The estimator outputs models in two types of files:
  - <iteration>.model is the model saved in the binary format, which is easy and fast to use for inference.
  - <iteration>.model.text is the model saved in the text format, which is convenient for printing topics or further analysis using a scripting language.
- It also produces variational posterior Dirichlets in a file called:
  - <iteration>.gamma
- Running the estimator on the 8-class image dataset produces the output:
```
  010.gamma
  010.model
  010.model.text
  020.gamma
  020.model
  020.model.text
  final.gamma
  final.model
  final.model.text
  likelihood.dat
  word-assignments.dat
```

Example usage:

./slda est test/images/train-data.dat test/images/train-label.dat \
    settings.txt 1.0 10 random tmp/

Inference

To perform inference on a different set of data (in the same format as for estimation), execute:

slda inf <data path> <label path> <settings path> <model path> <output directory>

<data path>, <label path>, and <settings path> are all the same as in the estimation step.
<model path> is the binary final.model file from the estimation step.
<output directory> is the output directory, where the predicted labels will be stored.
- Each output file has one line per input document.
  - inf-gamma.dat describes the variational posterior Dirichlets
  - inf-labels.dat displays the predicted labels
  - inf-likelihood.dat depicts each document's likelihood

Example usage:

./slda inf test/images/test-data.dat test/images/test-label.dat \
    settings.txt tmp/final.model tmp/

This will also produce a final line of output, evaluating against the labels specified in the <label path> argument:

average accuracy: 0.679

Sample data

The sample data in test/images was downloaded from http://www.cs.cmu.edu/~chongw/data/images.tgz on July 12, 2013.

Description of data from original site:

A preprocessed 8-class image dataset from Labelme.

UIUC Sports annotation files: annotations and meta information. The source image files can be found here. (Note: there might be some discrepancies and I don't seem to know why...)

License

Licensed under both the GPL v2 and GPL v3, as well as any future version of the GNU General Public License.

Related Skills

node-connect

351.4k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

110.7k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

351.4k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

351.4k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。