|topmost-logo| TopMost

.. |topmost-logo| image:: docs/source/_static/topmost-logo.png :width: 38

.. image:: https://img.shields.io/github/stars/bobxwu/topmost?logo=github :target: https://github.com/bobxwu/topmost/stargazers :alt: Github Stars

.. image:: https://static.pepy.tech/badge/topmost :target: https://pepy.tech/project/topmost :alt: Downloads

.. image:: https://img.shields.io/pypi/v/topmost :target: https://pypi.org/project/topmost :alt: PyPi

.. image:: https://readthedocs.org/projects/topmost/badge/?version=latest :target: https://topmost.readthedocs.io/en/latest/?badge=latest :alt: Documentation Status

.. image:: https://img.shields.io/github/license/bobxwu/topmost :target: https://www.apache.org/licenses/LICENSE-2.0/ :alt: License

.. image:: https://img.shields.io/github/contributors/bobxwu/topmost :target: https://github.com/bobxwu/topmost/graphs/contributors/ :alt: Contributors

.. image:: https://img.shields.io/badge/arXiv-2309.06908-<COLOR>.svg :target: https://arxiv.org/pdf/2309.06908.pdf :alt: arXiv

TopMost provides complete lifecycles of topic modeling, including datasets, preprocessing, models, training, and evaluations. It covers the most popular topic modeling scenarios, like basic, dynamic, hierarchical, and cross-lingual topic modeling.

| ACL 2024 Demo paper: Towards the TopMost: A Topic Modeling System Toolkit <https://arxiv.org/pdf/2309.06908.pdf>. | Survey paper on neural topic models (Artificial Intelligence Review): A Survey on Neural Topic Models: Methods, Applications, and Challenges <https://arxiv.org/pdf/2401.15351.pdf>.

| | If you want to use TopMost, please cite as

@inproceedings{wu2024topmost,
    title = "Towards the {T}op{M}ost: A Topic Modeling System Toolkit",
    author = "Wu, Xiaobao  and Pan, Fengjun  and Luu, Anh Tuan",
    editor = "Cao, Yixin  and Feng, Yang  and Xiong, Deyi",
    booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)",
    month = aug,
    year = "2024",
    address = "Bangkok, Thailand",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.acl-demos.4",
    pages = "31--41"
}

@article{wu2024survey,
    title={A Survey on Neural Topic Models: Methods, Applications, and Challenges},
    author={Wu, Xiaobao and Nguyen, Thong and Luu, Anh Tuan},
    journal={Artificial Intelligence Review},
    url={https://doi.org/10.1007/s10462-023-10661-7},
    year={2024},
    publisher={Springer}
}

==================

.. contents:: Table of Contents :depth: 2

============ Overview

TopMost offers the following topic modeling scenarios with models, evaluation metrics, and datasets:

.. image:: https://github.com/BobXWu/TopMost/raw/main/docs/source/_static/architecture.svg :width: 390 :align: center

+------------------------------+------- | Scenario | +==============================+======= | | | LDA_ | | | NMF_ | | | ProdLDA_ | | | DecTM_ | | Basic Topic Modeling | | ETM_ | | | NSTM_ | | | TSCTM_ | | | BERTopic_ | | | | ECRTM_ | | | FASTopic_ | +------------------------------+------- | | | | | HDP_ | | Hierarchical | | SawETM_ | | Topic Modeling | | | ProGBN_ | | | TraCo_ | | +------------------------------+------- | | | | Dynamic | | DTM_ | | Topic Modeling | | DETM_ | | | CFDTM_ +------------------------------+------- | | | | Cross-lingual | | NMTM_ | | Topic Modeling | | InfoCTM_ | | +------------------------------+------- --------+--------------------------------------------+-----------------+ Model | Evaluation Metric | Datasets | ========+============================================+=================+ | | | | | | 20NG | | | TC | | IMDB | | | TD | | NeurIPS | | | Clustering | | ACL | | | Classification | | NYT | | | | Wikitext-103 | | | | | | | | --------+--------------------------------------------+-----------------+ | | | 20NG | | | TC over levels | | IMDB | | | TD over levels | | NeurIPS | | | HyperMiner_ | | Clustering over levels | | ACL | | | Classification over levels | | NYT | | | | Wikitext-103 | | | | --------+--------------------------------------------+-----------------+ | | TC over time slices | | | | TD over time slices | | NeurIPS | | | Clustering | | ACL | | | Classification | | NYT | --------+--------------------------------------------+-----------------+ | | TC (CNPMI) | | ECNews | | | TD over languages | | Amazon | | | Classification (Intra and Cross-lingual) | | Review Rakuten| | | | | | --------+--------------------------------------------+-----------------+

============ Quick Start

Install TopMost

Install topmost with pip as

.. code-block:: console

$ pip install topmost

We try FASTopic_ to get the top words of discovered topics, topic_top_words and the topic distributions of documents, doc_topic_dist. The preprocessing steps are configurable. See our documentations.

.. code-block:: python

from topmost.data import RawDataset
from topmost.preprocess import Preprocess
from topmost.trainers import FASTopicTrainer
from sklearn.datasets import fetch_20newsgroups

docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']
preprocess = Preprocess(vocab_size=10000)

dataset = RawDataset(docs, preprocess, device="cuda")

trainer = FASTopicTrainer(dataset, verbose=True)
top_words, doc_topic_dist = trainer.train()

new_docs = [
    "This is a document about space, including words like space, satellite, launch, orbit.",
    "This is a document about Microsoft Windows, including words like windows, files, dos."
]

new_theta = trainer.test(new_docs)
print(new_theta.argmax(1))

============ Usage

Download a preprocessed dataset

.. code-block:: python

import topmost

topmost.download_dataset('20NG', cache_path='./datasets')

Train a model

.. code-block:: python

device = "cuda" # or "cpu"

# load a preprocessed dataset
dataset = topmost.BasicDataset("./datasets/20NG", device=device, read_labels=True)
# create a model
model = topmost.ProdLDA(dataset.vocab_size)
model = model.to(device)

# create a trainer
trainer = topmost.BasicTrainer(model, dataset)

# train the model
top_words, train_theta = trainer.train()

Evaluate

.. code-block:: python

from topmost import eva

# topic diversity and coherence
TD = eva._diversity(top_words)
TC = eva._coherence(dataset.train_texts, dataset.vocab, top_words)

# get doc-topic distributions of testing samples
test_theta = trainer.test(dataset.test_data)
# clustering
clustering_results = eva._clustering(test_theta, dataset.test_labels)
# classification
cls_results = eva._cls(train_theta, test_theta, dataset.train_labels, dataset.test_labels)

Test new documents

.. code-block:: python

import torch
from topmost import Preprocess

new_docs = [
    "This is a new document about space, including words like space, satellite, launch, orbit.",
    "This is a new document about Microsoft Windows, including words like windows, files, dos."
]

preprocess = Preprocess()
new_parsed_docs, new_bow = preprocess.parse(new_docs, vocab=dataset.vocab)
new_theta = trainer.test(torch.as_tensor(new_bow.toarray(), device=device).float())

============ Installation

Stable release

To install TopMost, run this command in the terminal:

.. code-block:: console

$ pip install topmost

This is the preferred method to install TopMost, as it will always install the most recent stable release.

From

TopMost

Install / Use

README

|topmost-logo| TopMost

============ Overview

============ Quick Start

Install TopMost

============ Usage

Download a preprocessed dataset

Train a model

Evaluate

Test new documents

============ Installation

Stable release