SkillAgentSearch skills...

TopMost

A Topic Modeling System Toolkit (ACL 2024 Demo)

Install / Use

/learn @bobxwu/TopMost

README

|topmost-logo| TopMost

.. |topmost-logo| image:: docs/source/_static/topmost-logo.png :width: 38

.. image:: https://img.shields.io/github/stars/bobxwu/topmost?logo=github :target: https://github.com/bobxwu/topmost/stargazers :alt: Github Stars

.. image:: https://static.pepy.tech/badge/topmost :target: https://pepy.tech/project/topmost :alt: Downloads

.. image:: https://img.shields.io/pypi/v/topmost :target: https://pypi.org/project/topmost :alt: PyPi

.. image:: https://readthedocs.org/projects/topmost/badge/?version=latest :target: https://topmost.readthedocs.io/en/latest/?badge=latest :alt: Documentation Status

.. image:: https://img.shields.io/github/license/bobxwu/topmost :target: https://www.apache.org/licenses/LICENSE-2.0/ :alt: License

.. image:: https://img.shields.io/github/contributors/bobxwu/topmost :target: https://github.com/bobxwu/topmost/graphs/contributors/ :alt: Contributors

.. image:: https://img.shields.io/badge/arXiv-2309.06908-<COLOR>.svg :target: https://arxiv.org/pdf/2309.06908.pdf :alt: arXiv

TopMost provides complete lifecycles of topic modeling, including datasets, preprocessing, models, training, and evaluations. It covers the most popular topic modeling scenarios, like basic, dynamic, hierarchical, and cross-lingual topic modeling.

| ACL 2024 Demo paper: Towards the TopMost: A Topic Modeling System Toolkit <https://arxiv.org/pdf/2309.06908.pdf>. | Survey paper on neural topic models (Artificial Intelligence Review): A Survey on Neural Topic Models: Methods, Applications, and Challenges <https://arxiv.org/pdf/2401.15351.pdf>.

| | If you want to use TopMost, please cite as

::

@inproceedings{wu2024topmost,
    title = "Towards the {T}op{M}ost: A Topic Modeling System Toolkit",
    author = "Wu, Xiaobao  and Pan, Fengjun  and Luu, Anh Tuan",
    editor = "Cao, Yixin  and Feng, Yang  and Xiong, Deyi",
    booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)",
    month = aug,
    year = "2024",
    address = "Bangkok, Thailand",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.acl-demos.4",
    pages = "31--41"
}

@article{wu2024survey,
    title={A Survey on Neural Topic Models: Methods, Applications, and Challenges},
    author={Wu, Xiaobao and Nguyen, Thong and Luu, Anh Tuan},
    journal={Artificial Intelligence Review},
    url={https://doi.org/10.1007/s10462-023-10661-7},
    year={2024},
    publisher={Springer}
}

==================

.. contents:: Table of Contents :depth: 2

============ Overview

TopMost offers the following topic modeling scenarios with models, evaluation metrics, and datasets:

.. image:: https://github.com/BobXWu/TopMost/raw/main/docs/source/_static/architecture.svg :width: 390 :align: center

+------------------------------+---------------+--------------------------------------------+-----------------+ | Scenario | Model | Evaluation Metric | Datasets | +==============================+===============+============================================+=================+ | | | LDA_ | | | | | | NMF_ | | | 20NG | | | | ProdLDA_ | | TC | | IMDB | | | | DecTM_ | | TD | | NeurIPS | | | Basic Topic Modeling | | ETM_ | | Clustering | | ACL | | | | NSTM_ | | Classification | | NYT | | | | TSCTM_ | | | Wikitext-103 | | | | BERTopic_ | | | | | | ECRTM_ | | | | | | FASTopic_ | | | +------------------------------+---------------+--------------------------------------------+-----------------+ | | | | | 20NG | | | | HDP_ | | TC over levels | | IMDB | | | Hierarchical | | SawETM_ | | TD over levels | | NeurIPS | | | Topic Modeling | | HyperMiner_ | | Clustering over levels | | ACL | | | | ProGBN_ | | Classification over levels | | NYT | | | | TraCo_ | | | Wikitext-103 | | | | | | +------------------------------+---------------+--------------------------------------------+-----------------+ | | | | TC over time slices | | | | Dynamic | | DTM_ | | TD over time slices | | NeurIPS | | | Topic Modeling | | DETM_ | | Clustering | | ACL | | | | CFDTM_ | | Classification | | NYT | +------------------------------+---------------+--------------------------------------------+-----------------+ | | | | TC (CNPMI) | | ECNews | | | Cross-lingual | | NMTM_ | | TD over languages | | Amazon | | | Topic Modeling | | InfoCTM_ | | Classification (Intra and Cross-lingual) | | Review Rakuten| | | | | | | | +------------------------------+---------------+--------------------------------------------+-----------------+

============ Quick Start

Install TopMost

Install topmost with pip as

.. code-block:: console

$ pip install topmost

We try FASTopic_ to get the top words of discovered topics, topic_top_words and the topic distributions of documents, doc_topic_dist. The preprocessing steps are configurable. See our documentations.

.. code-block:: python

from topmost.data import RawDataset
from topmost.preprocess import Preprocess
from topmost.trainers import FASTopicTrainer
from sklearn.datasets import fetch_20newsgroups

docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']
preprocess = Preprocess(vocab_size=10000)

dataset = RawDataset(docs, preprocess, device="cuda")

trainer = FASTopicTrainer(dataset, verbose=True)
top_words, doc_topic_dist = trainer.train()

new_docs = [
    "This is a document about space, including words like space, satellite, launch, orbit.",
    "This is a document about Microsoft Windows, including words like windows, files, dos."
]

new_theta = trainer.test(new_docs)
print(new_theta.argmax(1))

============ Usage

Download a preprocessed dataset

.. code-block:: python

import topmost

topmost.download_dataset('20NG', cache_path='./datasets')

Train a model

.. code-block:: python

device = "cuda" # or "cpu"

# load a preprocessed dataset
dataset = topmost.BasicDataset("./datasets/20NG", device=device, read_labels=True)
# create a model
model = topmost.ProdLDA(dataset.vocab_size)
model = model.to(device)

# create a trainer
trainer = topmost.BasicTrainer(model, dataset)

# train the model
top_words, train_theta = trainer.train()

Evaluate

.. code-block:: python

from topmost import eva

# topic diversity and coherence
TD = eva._diversity(top_words)
TC = eva._coherence(dataset.train_texts, dataset.vocab, top_words)

# get doc-topic distributions of testing samples
test_theta = trainer.test(dataset.test_data)
# clustering
clustering_results = eva._clustering(test_theta, dataset.test_labels)
# classification
cls_results = eva._cls(train_theta, test_theta, dataset.train_labels, dataset.test_labels)

Test new documents

.. code-block:: python

import torch
from topmost import Preprocess

new_docs = [
    "This is a new document about space, including words like space, satellite, launch, orbit.",
    "This is a new document about Microsoft Windows, including words like windows, files, dos."
]

preprocess = Preprocess()
new_parsed_docs, new_bow = preprocess.parse(new_docs, vocab=dataset.vocab)
new_theta = trainer.test(torch.as_tensor(new_bow.toarray(), device=device).float())

============ Installation

Stable release

To install TopMost, run this command in the terminal:

.. code-block:: console

$ pip install topmost

This is the preferred method to install TopMost, as it will always install the most recent stable release.

From

View on GitHub
GitHub Stars287
CategoryEducation
Updated4d ago
Forks26

Languages

Jupyter Notebook

Security Score

100/100

Audited on Mar 24, 2026

No findings