SkillAgentSearch skills...

BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.

Install / Use

/learn @MaartenGr/BERTopic

README

PyPI Downloads PyPI - Python Build docs PyPI - PyPi PyPI - License arXiv

BERTopic

<img src="images/logo.png" width="35%" align="right" />

BERTopic is a topic modeling technique that leverages 🤗 transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions.

BERTopic supports all kinds of topic modeling techniques:

<table> <tr> <td><a href="https://maartengr.github.io/BERTopic/getting_started/guided/guided.html">Guided</a></td> <td><a href="https://maartengr.github.io/BERTopic/getting_started/supervised/supervised.html">Supervised</a></td> <td><a href="https://maartengr.github.io/BERTopic/getting_started/semisupervised/semisupervised.html">Semi-supervised</a></td> </tr> <tr> <td><a href="https://maartengr.github.io/BERTopic/getting_started/manual/manual.html">Manual</a></td> <td><a href="https://maartengr.github.io/BERTopic/getting_started/distribution/distribution.html">Multi-topic distributions</a></td> <td><a href="https://maartengr.github.io/BERTopic/getting_started/hierarchicaltopics/hierarchicaltopics.html">Hierarchical</a></td> </tr> <tr> <td><a href="https://maartengr.github.io/BERTopic/getting_started/topicsperclass/topicsperclass.html">Class-based</a></td> <td><a href="https://maartengr.github.io/BERTopic/getting_started/topicsovertime/topicsovertime.html">Dynamic</a></td> <td><a href="https://maartengr.github.io/BERTopic/getting_started/online/online.html">Online/Incremental</a></td> </tr> <tr> <td><a href="https://maartengr.github.io/BERTopic/getting_started/multimodal/multimodal.html">Multimodal</a></td> <td><a href="https://maartengr.github.io/BERTopic/getting_started/multiaspect/multiaspect.html">Multi-aspect</a></td> <td><a href="https://maartengr.github.io/BERTopic/getting_started/representation/llm.html">Text Generation/LLM</a></td> </tr> <tr> <td><a href="https://maartengr.github.io/BERTopic/getting_started/zeroshot/zeroshot.html">Zero-shot <b>(new!)</b></a></td> <td><a href="https://maartengr.github.io/BERTopic/getting_started/merge/merge.html">Merge Models <b>(new!)</b></a></td> <td><a href="https://maartengr.github.io/BERTopic/getting_started/seed_words/seed_words.html">Seed Words <b>(new!)</b></a></td> </tr> </table>

Corresponding medium posts can be found here, here and here. For a more detailed overview, you can read the paper or see a brief overview.

Installation

Installation, with sentence-transformers, can be done using uv:

uv add bertopic

or with pip:

pip install bertopic

If you want to install BERTopic with other embedding models, you can choose one of the following:

# Choose an embedding backend
pip install bertopic[flair,gensim,spacy,use]

# Topic modeling with images
pip install bertopic[vision]

For a light-weight installation without transformers, UMAP and/or HDBSCAN (for training with Model2Vec or inference), see this tutorial.

Getting Started

For an in-depth overview of the features of BERTopic you can check the full documentation or you can follow along with one of the examples below:

| Name | Link | |---|---| | Start Here - Best Practices in BERTopic | Open In Colab | | 🆕 New! - Topic Modeling on Large Data (GPU Acceleration) | Open In Colab | | 🆕 New! - Topic Modeling with Llama 2 🦙 | Open In Colab | | 🆕 New! - Topic Modeling with Quantized LLMs | Open In Colab | | Topic Modeling with BERTopic | Open In Colab | | (Custom) Embedding Models in BERTopic | Open In Colab | | Advanced Customization in BERTopic | Open In Colab | | (semi-)Supervised Topic Modeling with BERTopic | Open In Colab | | Dynamic Topic Modeling with Trump's Tweets | Open In Colab | | Topic Modeling arXiv Abstracts | Kaggle |

Quick Start

We start by extracting topics from the well-known 20 newsgroups dataset containing English documents:

from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
 
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']

topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)

After generating topics and their probabilities, we can access all of the topics together with their topic representations:

>>> topic_model.get_topic_info()

Topic	Count	Name
-1	4630	-1_can_your_will_any
0	693	49_windows_drive_dos_file
1	466	32_jesus_bible_christian_faith
2	441	2_space_launch_orbit_lunar
3	381	22_key_encryption_keys_encrypted
...

The -1 topic refers to all outlier documents and are typically ignored. Each word in a topic describes the underlying theme of that topic and can be used for interpreting that topic. Next, let's take a look at the most frequent topic that was generated:

>>> topic_model.get_topic(0)

[('windows', 0.006152228076250982),
 ('drive', 0.004982897610645755),
 ('dos', 0.004845038866360651),
 ('file', 0.004140142872194834),
 ('disk', 0.004131678774810884),
 ('mac', 0.003624848635985097),
 ('memory', 0.0034840976976789903),
 ('software', 0.0034415334250699077),
 ('email', 0.0034239554442333257),
 ('pc', 0.003047105930670237)]

Using .get_document_info, we can also extract information on a document level, such as their corresponding topics, probabilities, whether they are representative documents for a topic, etc.:

>>> topic_model.get_document_info(docs)

Document                               Topic	Name	                        Top_n_words                     Probability    ...
I am sure some bashers of Pens...	0	0_game_team_games_season	game - team - games...	        0.200010       ...
My brother is in the market for...      -1     -1_can_your_will_any	        can - your - will...	        0.420668       ...
Finally you said what you dream...	-1     -1_can_your_will_any	        can - your - will...            0.807259       ...
Think! It's the SCSI card doing...	49     49_windows_drive_dos_file	windows - drive - docs...	0.071746       ...
1) I have an old Jasmine drive...	49     49_windows_drive_dos_file	windows - drive - docs...	0.038983       ...

🔥 Tip: Use BERTopic(language="multilingual") to select a model that supports 50+ languages.

Fine-tune Topic Representations

In BERTopic, there are a number of different topic representations that we can choose from. They are all quite different from one another and give interesting perspectives and variations of topic representations. A great start is KeyBERTInspired, which for many users increases the coherence and reduces stopwords from the resulting topic representations:

from bertopic.representation import KeyBERTInspired

# Fine-tune your topic representations
representation_model = KeyBERTInspired()
topic_model = BERTopic(representation_model=representation_model)

However, you might

Related Skills

View on GitHub
GitHub Stars7.5k
CategoryEducation
Updated8h ago
Forks885

Languages

Python

Security Score

100/100

Audited on Mar 28, 2026

No findings