Hlda
Gibbs sampler for the Hierarchical Latent Dirichlet Allocation topic model
Install / Use
/learn @joewandy/HldaREADME
Hierarchical Latent Dirichlet Allocation
Note: this repository should only be used for education purpose. For production use, I'd recommend using https://github.com/bab2min/tomotopy which is more production-ready
Hierarchical Latent Dirichlet Allocation (hLDA) addresses the problem of learning topic hierarchies from data. The model relies on a non‑parametric prior called the nested Chinese restaurant process, which allows for arbitrarily large branching factors and easily accommodates growing data collections. The hLDA model combines this prior with a likelihood based on a hierarchical variant of Latent Dirichlet Allocation.
The original papers describing the algorithm are:
- Hierarchical Topic Models and the Nested Chinese Restaurant Process
- The Nested Chinese Restaurant Process and Bayesian Nonparametric Inference of Topic Hierarchies
Overview
This repository contains a pure Python implementation of the Gibbs sampler for hLDA. It is intended for experimentation and as a reference implementation. The code follows the approach used in the original Mallet implementation but with a simplified interface and a fixed depth for the tree.
Key features include:
- Python 3.11+ support with minimal third‑party dependencies.
- A small set of example scripts demonstrating how to run the sampler.
- Utilities for visualising the resulting topic hierarchy.
- Test suite for verifying the sampler on synthetic data and a small BBC corpus.
Installation
The package can be installed directly from PyPI:
pip install hlda
Alternatively, to develop locally, clone this repository and install it in editable mode:
git clone https://github.com/joewandy/hlda.git
cd hlda
pip install -e .
pre-commit install
Usage
The easiest way to get started is by using the sample BBC dataset provided in the
data/ directory. You can run the full demonstration from the command line:
python examples/bbc_demo.py --data-dir data/bbc/tech --iterations 20
If you installed the package from PyPI you can run the same demo via the
hlda-run command:
hlda-run --data-dir data/bbc/tech --iterations 20
To write the learned hierarchy to disk in JSON format, pass
--export-tree <file> when running the script:
python scripts/run_hlda.py --data-dir data/bbc/tech --export-tree tree.json
If you make use of the BBC dataset, please cite the publication by Greene and
Cunningham (2006) as detailed in CITATION.cff.
Example scripts for the BBC dataset and synthetic data are available in the
examples/ directory.
Within Python you can also construct the sampler directly:
from hlda.sampler import HierarchicalLDA
corpus = [["word", "word", ...], ...] # list of tokenised documents
vocab = sorted({w for doc in corpus for w in doc})
hlda = HierarchicalLDA(corpus, vocab, alpha=1.0, gamma=1.0, eta=0.1,
num_levels=3, seed=0)
hlda.estimate(iterations=50, display_topics=10)
Integration with scikit-learn
The package provides a HierarchicalLDAEstimator that follows the scikit-learn API. This allows using the sampler inside a standard Pipeline.
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline
from hlda.sklearn_wrapper import HierarchicalLDAEstimator
vectorizer = CountVectorizer()
prep = FunctionTransformer(
lambda X: (
[[i for i, c in enumerate(row) for _ in range(int(c))] for row in X.toarray()],
list(vectorizer.get_feature_names_out()),
),
validate=False,
)
pipeline = Pipeline([
("vect", vectorizer),
("prep", prep),
("hlda", HierarchicalLDAEstimator(num_levels=3, iterations=10, seed=0)),
])
pipeline.fit(documents)
assignments = pipeline.transform(documents)
Running the tests
The repository includes a small test suite that checks the sampler on both the BBC corpus and synthetic data. After installing the development dependencies you can run:
pytest -q
All tests should pass in a few seconds.
License
This project is licensed under the terms of the MIT license. See
LICENSE.txt for details.
Related Skills
node-connect
349.7kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
109.7kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
349.7kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
349.7kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
