STREAM
ACL Python package engineered for seamless topic modeling, topic evaluation, and topic visualization. Ideal for text analysis, natural language processing (NLP), and research in the social sciences, STREAM simplifies the extraction, interpretation, and visualization of topics from large, complex datasets.
Install / Use
/learn @AnFreTh/STREAMREADME
📘Documentation | 🛠️Installation | Models | 🤔Report Issues
</div> <h1 style="text-align: center;">STREAM: Simplified Topic Retrieval, Exploration, and Analysis Module</h1> <h3 style="text-align: center;">- Topic Modeling Made Easy in Python -</h3> <p>We present STREAM, a Simplified Topic Retrieval, Exploration, and Analysis Module for User-Friendly and Interactive Topic Modeling and Visualization. Our paper can be found <a href="https://aclanthology.org/2024.acl-short.41.pdf">here</a>.</p> <h2> Table of Contents </h2>- 🏃 Quick Start
- 🚀 Installation
- 📦 Available Models
- 📊 Available Metrics
- 🗂️ Available Datasets
- 🔧 Usage
- 📜 Citation
- 📝 License
🏃 Quick Start
Get started with STREAM in just a few lines of code:
from stream_topic.models import KmeansTM
from stream_topic.utils import TMDataset
dataset = TMDataset()
dataset.fetch_dataset("BBC_News")
dataset.preprocess(model_type="KmeansTM")
model = KmeansTM()
model.fit(dataset, n_topics=20)
topics = model.get_topics()
print(topics)
🚀 Installation
You can install STREAM directly from PyPI or from the GitHub repository:
-
PyPI (Recommended):
pip install stream-topic -
GitHub:
pip install git+https://github.com/AnFreTh/STREAM.git -
Download necessary NLTK resources:
To download all necessary NLTK resources required for some models, simply run:
import nltk def ensure_nltk_resources(): resources = [ "stopwords", "wordnet", "punkt_tab", "brown", "averaged_perceptron_tagger" ] for resource in resources: try: nltk.data.find(resource) except LookupError: try: print(f"Downloading NLTK resource: {resource}") nltk.download(resource) except Exception as e: print(f"Failed to download {resource}: {e}") ensure_nltk_resources() -
Install requirements for add-ons: To use STREAMS visualizations, simply run:
pip install stream-topic[plotting]For BERTopic, run:
pip install stream-topic[hdbscan]For DCTE:
pip install stream-topic[dcte]For the experimental features:
pip install stream-topic[experimental]
📦 Available Models
STREAM offers a variety of neural as well as non-neural topic models and we are always trying to incorporate more and new models. If you wish to incorporate your own model, or want another model incorporated please raise an issue with the required information. Currently, the following models are implemented:
<div align="center" style="width: 100%;"> <table style="margin: 0 auto;"> <thead> <tr> <th><strong>Name</strong></th> <th><strong>Implementation</strong></th> </tr> </thead> <tbody> <tr> <td><a href="https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf?ref=http://githubhelp.com">LDA</a></td> <td>Latent Dirichlet Allocation</td> </tr> <tr> <td><a href="https://www.nature.com/articles/44565">NMF</a></td> <td>Non-negative Matrix Factorization</td> </tr> <tr> <td><a href="https://arxiv.org/abs/2004.14914">WordCluTM</a></td> <td>Tired of topic models?</td> </tr> <tr> <td><a href="https://direct.mit.edu/coli/article/doi/10.1162/coli_a_00506/118990/Topics-in-the-Haystack-Enhancing-Topic-Quality?searchresult=1">CEDC</a></td> <td>Topics in the Haystack</td> </tr> <tr> <td><a href="https://arxiv.org/pdf/2212.09422.pdf">DCTE</a></td> <td>Human in the Loop</td> </tr> <tr> <td><a href="https://direct.mit.edu/coli/article/doi/10.1162/coli_a_00506/118990/Topics-in-the-Haystack-Enhancing-Topic-Quality?searchresult=1">KMeansTM</a></td> <td>Simple Kmeans followed by c-tfidf</td> </tr> <tr> <td><a href="https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=b3c81b523b1f03c87192aa2abbf9ffb81a143e54">SomTM</a></td> <td>Self organizing map followed by c-tfidf</td> </tr> <tr> <td><a href="https://ieeexplore.ieee.org/abstract/document/10066754">CBC</a></td> <td>Coherence based document clustering</td> </tr> <tr> <td><a href="https://arxiv.org/pdf/2403.03737">TNTM</a></td> <td>Transformer-Representation Neural Topic Model</td> </tr> <tr> <td><a href="https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00325/96463/Topic-Modeling-in-Embedding-Spaces">ETM</a></td> <td>Topic modeling in embedding spaces</td> </tr> <tr> <td><a href="https://arxiv.org/abs/2004.03974">CTM</a></td> <td>Combined Topic Model</td> </tr> <tr> <td><a href="https://arxiv.org/abs/2303.14951">CTMNeg</a></td> <td>Contextualized Topic Models with Negative Sampling</td> </tr> <tr> <td><a href="https://arxiv.org/abs/1703.01488">ProdLDA</a></td> <td>Autoencoding Variational Inference For Topic Models</td> </tr> <tr> <td><a href="https://arxiv.org/abs/1703.01488">NeuralLDA</a></td> <td>Autoencoding Variational Inference For Topic Models</td> </tr> <tr> <td><a href="https://arxiv.org/abs/2008.13537">NSTM</a></td> <td>Neural Topic Model via Optimal Transport</td> </tr> </tbody> </table> </div>📊 Available Metrics
Since evaluating topic models, especially automatically, STREAM implements numerous evaluation metrics. Especially, the intruder based metrics, while they might take some time to compute, have shown great correlation with human evaluation.
<div align="center" style="width: 100%;"> <table style="margin: 0 auto;"> <thead> <tr> <th><strong>Name</strong></th> <th><strong>Description</strong></th> </tr> </thead> <tbody> <tr> <td><a href="https://direct.mit.edu/coli/article/doi/10.1162/coli_a_00506/118990/Topics-in-the-Haystack-Enhancing-Topic-Quality?searchresult=1">ISIM</a></td> <td>Average cosine similarity of top words of a topic to an intruder word.</td> </tr> <tr> <td><a href="https://direct.mit.edu/coli/article/doi/10.1162/coli_a_00506/118990/Topics-in-the-Haystack-Enhancing-Topic-Quality?searchresult=1">INT</a></td> <td>For a given topic and a given intruder word, Intruder Accuracy is the fraction of top words to which the intruder has the least similar embedding among all top words.</td> </tr> <tr> <td><a href="https://direct.mit.edu/coli/article/doi/10.1162/coli_a_00506/118990/Topics-in-the-Haystack-Enhancing-Topic-Quality?searchresult=1">ISH</a></td> <td>Calculates the shift in the centroid of a topic when an intruder word is replaced.</td> </tr> <tr> <td><a href="https://direct.mit.edu/coli/article/doi/10.1162/coli_a_00506/118990/Topics-in-the-Haystack-Enhancing-Topic-Quality?searchresult=1">Expressivity</a></td> <td>Cosine Distance of topics to meaningless (stopword) embedding centroid</td> </tr> <tr> <td><a href="https://link.springer.com/chapter/10.1007/978-3-030-80599-9_4">Embedding Topic Diversity</a></td> <td>Topic diversity in the embedding space</td> </tr> <tr> <td><a href="https://direct.mit.edu/coli/article/doi/10.1162/coli_a_00506/118990/Topics-in-the-Haystack-Enhancing-Topic-Quality?searchresult=1">Embedding Coherence</a></td> <td>Cosine similarity between the centroid of the embeddings of the stopwords and the centroid of the topic.</td> </tr> <tr> <td><a href="https://aclanthology.org/E14-1056.pdf">NPMI</a></td> <td>Classical NPMi coherence computed on the source corpus.</td> </tr> </tbody> </table> </div>🗂️ Available Datasets
To integrate custom datasets for modeling with STREAM, please follow the example notebook in the examples folder. For benchmarking new models, STREAM already includes the following datasets:
<div align="center" style="width: 100%;"> <table style="margin: 0 auto;"> <thead> <tr> <th>Name</th> <th># Docs</th> <th># Words</th> <th># Features</th> <th>Description</th> </tr> </thead> <tbody> <tr> <td>Spotify_most_popular</td> <td>5,860</td> <td>18,193</td>