TopMost
A Topic Modeling System Toolkit (ACL 2024 Demo)
Install / Use
/learn @bobxwu/TopMostREADME
|topmost-logo| TopMost
.. |topmost-logo| image:: docs/source/_static/topmost-logo.png :width: 38
.. image:: https://img.shields.io/github/stars/bobxwu/topmost?logo=github :target: https://github.com/bobxwu/topmost/stargazers :alt: Github Stars
.. image:: https://static.pepy.tech/badge/topmost :target: https://pepy.tech/project/topmost :alt: Downloads
.. image:: https://img.shields.io/pypi/v/topmost :target: https://pypi.org/project/topmost :alt: PyPi
.. image:: https://readthedocs.org/projects/topmost/badge/?version=latest :target: https://topmost.readthedocs.io/en/latest/?badge=latest :alt: Documentation Status
.. image:: https://img.shields.io/github/license/bobxwu/topmost :target: https://www.apache.org/licenses/LICENSE-2.0/ :alt: License
.. image:: https://img.shields.io/github/contributors/bobxwu/topmost :target: https://github.com/bobxwu/topmost/graphs/contributors/ :alt: Contributors
.. image:: https://img.shields.io/badge/arXiv-2309.06908-<COLOR>.svg :target: https://arxiv.org/pdf/2309.06908.pdf :alt: arXiv
TopMost provides complete lifecycles of topic modeling, including datasets, preprocessing, models, training, and evaluations. It covers the most popular topic modeling scenarios, like basic, dynamic, hierarchical, and cross-lingual topic modeling.
| ACL 2024 Demo paper: Towards the TopMost: A Topic Modeling System Toolkit <https://arxiv.org/pdf/2309.06908.pdf>.
| Survey paper on neural topic models (Artificial Intelligence Review): A Survey on Neural Topic Models: Methods, Applications, and Challenges <https://arxiv.org/pdf/2401.15351.pdf>.
| | If you want to use TopMost, please cite as
::
@inproceedings{wu2024topmost,
title = "Towards the {T}op{M}ost: A Topic Modeling System Toolkit",
author = "Wu, Xiaobao and Pan, Fengjun and Luu, Anh Tuan",
editor = "Cao, Yixin and Feng, Yang and Xiong, Deyi",
booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)",
month = aug,
year = "2024",
address = "Bangkok, Thailand",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.acl-demos.4",
pages = "31--41"
}
@article{wu2024survey,
title={A Survey on Neural Topic Models: Methods, Applications, and Challenges},
author={Wu, Xiaobao and Nguyen, Thong and Luu, Anh Tuan},
journal={Artificial Intelligence Review},
url={https://doi.org/10.1007/s10462-023-10661-7},
year={2024},
publisher={Springer}
}
==================
.. contents:: Table of Contents :depth: 2
============ Overview
TopMost offers the following topic modeling scenarios with models, evaluation metrics, and datasets:
.. image:: https://github.com/BobXWu/TopMost/raw/main/docs/source/_static/architecture.svg :width: 390 :align: center
+------------------------------+---------------+--------------------------------------------+-----------------+ | Scenario | Model | Evaluation Metric | Datasets | +==============================+===============+============================================+=================+ | | | LDA_ | | | | | | NMF_ | | | 20NG | | | | ProdLDA_ | | TC | | IMDB | | | | DecTM_ | | TD | | NeurIPS | | | Basic Topic Modeling | | ETM_ | | Clustering | | ACL | | | | NSTM_ | | Classification | | NYT | | | | TSCTM_ | | | Wikitext-103 | | | | BERTopic_ | | | | | | ECRTM_ | | | | | | FASTopic_ | | | +------------------------------+---------------+--------------------------------------------+-----------------+ | | | | | 20NG | | | | HDP_ | | TC over levels | | IMDB | | | Hierarchical | | SawETM_ | | TD over levels | | NeurIPS | | | Topic Modeling | | HyperMiner_ | | Clustering over levels | | ACL | | | | ProGBN_ | | Classification over levels | | NYT | | | | TraCo_ | | | Wikitext-103 | | | | | | +------------------------------+---------------+--------------------------------------------+-----------------+ | | | | TC over time slices | | | | Dynamic | | DTM_ | | TD over time slices | | NeurIPS | | | Topic Modeling | | DETM_ | | Clustering | | ACL | | | | CFDTM_ | | Classification | | NYT | +------------------------------+---------------+--------------------------------------------+-----------------+ | | | | TC (CNPMI) | | ECNews | | | Cross-lingual | | NMTM_ | | TD over languages | | Amazon | | | Topic Modeling | | InfoCTM_ | | Classification (Intra and Cross-lingual) | | Review Rakuten| | | | | | | | +------------------------------+---------------+--------------------------------------------+-----------------+
============ Quick Start
Install TopMost
Install topmost with pip as
.. code-block:: console
$ pip install topmost
We try FASTopic_ to get the top words of discovered topics, topic_top_words and the topic distributions of documents, doc_topic_dist.
The preprocessing steps are configurable. See our documentations.
.. code-block:: python
from topmost.data import RawDataset
from topmost.preprocess import Preprocess
from topmost.trainers import FASTopicTrainer
from sklearn.datasets import fetch_20newsgroups
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
preprocess = Preprocess(vocab_size=10000)
dataset = RawDataset(docs, preprocess, device="cuda")
trainer = FASTopicTrainer(dataset, verbose=True)
top_words, doc_topic_dist = trainer.train()
new_docs = [
"This is a document about space, including words like space, satellite, launch, orbit.",
"This is a document about Microsoft Windows, including words like windows, files, dos."
]
new_theta = trainer.test(new_docs)
print(new_theta.argmax(1))
============ Usage
Download a preprocessed dataset
.. code-block:: python
import topmost
topmost.download_dataset('20NG', cache_path='./datasets')
Train a model
.. code-block:: python
device = "cuda" # or "cpu"
# load a preprocessed dataset
dataset = topmost.BasicDataset("./datasets/20NG", device=device, read_labels=True)
# create a model
model = topmost.ProdLDA(dataset.vocab_size)
model = model.to(device)
# create a trainer
trainer = topmost.BasicTrainer(model, dataset)
# train the model
top_words, train_theta = trainer.train()
Evaluate
.. code-block:: python
from topmost import eva
# topic diversity and coherence
TD = eva._diversity(top_words)
TC = eva._coherence(dataset.train_texts, dataset.vocab, top_words)
# get doc-topic distributions of testing samples
test_theta = trainer.test(dataset.test_data)
# clustering
clustering_results = eva._clustering(test_theta, dataset.test_labels)
# classification
cls_results = eva._cls(train_theta, test_theta, dataset.train_labels, dataset.test_labels)
Test new documents
.. code-block:: python
import torch
from topmost import Preprocess
new_docs = [
"This is a new document about space, including words like space, satellite, launch, orbit.",
"This is a new document about Microsoft Windows, including words like windows, files, dos."
]
preprocess = Preprocess()
new_parsed_docs, new_bow = preprocess.parse(new_docs, vocab=dataset.vocab)
new_theta = trainer.test(torch.as_tensor(new_bow.toarray(), device=device).float())
============ Installation
Stable release
To install TopMost, run this command in the terminal:
.. code-block:: console
$ pip install topmost
This is the preferred method to install TopMost, as it will always install the most recent stable release.
From
