Hlda

Gibbs sampler for the Hierarchical Latent Dirichlet Allocation topic model

Generate Convert Improve

Install / Use

/learn @joewandy/Hlda

About this skill

Quality Score

0/100

README

Hierarchical Latent Dirichlet Allocation

Note: this repository should only be used for education purpose. For production use, I'd recommend using https://github.com/bab2min/tomotopy which is more production-ready

Hierarchical Latent Dirichlet Allocation (hLDA) addresses the problem of learning topic hierarchies from data. The model relies on a non‑parametric prior called the nested Chinese restaurant process, which allows for arbitrarily large branching factors and easily accommodates growing data collections. The hLDA model combines this prior with a likelihood based on a hierarchical variant of Latent Dirichlet Allocation.

The original papers describing the algorithm are:

Overview

This repository contains a pure Python implementation of the Gibbs sampler for hLDA. It is intended for experimentation and as a reference implementation. The code follows the approach used in the original Mallet implementation but with a simplified interface and a fixed depth for the tree.

Key features include:

Python 3.11+ support with minimal third‑party dependencies.
A small set of example scripts demonstrating how to run the sampler.
Utilities for visualising the resulting topic hierarchy.
Test suite for verifying the sampler on synthetic data and a small BBC corpus.

Installation

The package can be installed directly from PyPI:

pip install hlda

Alternatively, to develop locally, clone this repository and install it in editable mode:

git clone https://github.com/joewandy/hlda.git
cd hlda
pip install -e .
pre-commit install

Usage

The easiest way to get started is by using the sample BBC dataset provided in the data/ directory. You can run the full demonstration from the command line:

python examples/bbc_demo.py --data-dir data/bbc/tech --iterations 20

If you installed the package from PyPI you can run the same demo via the hlda-run command:

hlda-run --data-dir data/bbc/tech --iterations 20

To write the learned hierarchy to disk in JSON format, pass --export-tree <file> when running the script:

python scripts/run_hlda.py --data-dir data/bbc/tech --export-tree tree.json

If you make use of the BBC dataset, please cite the publication by Greene and Cunningham (2006) as detailed in CITATION.cff.

Example scripts for the BBC dataset and synthetic data are available in the examples/ directory.

Within Python you can also construct the sampler directly:

from hlda.sampler import HierarchicalLDA

corpus = [["word", "word", ...], ...]  # list of tokenised documents
vocab = sorted({w for doc in corpus for w in doc})

hlda = HierarchicalLDA(corpus, vocab, alpha=1.0, gamma=1.0, eta=0.1,
                       num_levels=3, seed=0)
hlda.estimate(iterations=50, display_topics=10)

Integration with scikit-learn

The package provides a HierarchicalLDAEstimator that follows the scikit-learn API. This allows using the sampler inside a standard Pipeline.

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline
from hlda.sklearn_wrapper import HierarchicalLDAEstimator

vectorizer = CountVectorizer()
prep = FunctionTransformer(
    lambda X: (
        [[i for i, c in enumerate(row) for _ in range(int(c))] for row in X.toarray()],
        list(vectorizer.get_feature_names_out()),
    ),
    validate=False,
)

pipeline = Pipeline([
    ("vect", vectorizer),
    ("prep", prep),
    ("hlda", HierarchicalLDAEstimator(num_levels=3, iterations=10, seed=0)),
])

pipeline.fit(documents)
assignments = pipeline.transform(documents)

Running the tests

The repository includes a small test suite that checks the sampler on both the BBC corpus and synthetic data. After installing the development dependencies you can run:

pytest -q

All tests should pass in a few seconds.

License

This project is licensed under the terms of the MIT license. See LICENSE.txt for details.

Related Skills

node-connect

349.7k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

109.7k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

349.7k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

349.7k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。