OCTIS
OCTIS: Comparing Topic Models is Simple! A python package to optimize and evaluate topic models (accepted at EACL2021 demo track)
Install / Use
/learn @MIND-Lab/OCTISREADME
========================================================= OCTIS : Optimizing and Comparing Topic Models is Simple!
.. |colab1| image:: https://colab.research.google.com/assets/colab-badge.svg :target: https://colab.research.google.com/github/MIND-Lab/OCTIS/blob/master/examples/OCTIS_LDA_training_only.ipynb :alt: Open In Colab
.. |colab2| image:: https://colab.research.google.com/assets/colab-badge.svg :target: https://colab.research.google.com/github/MIND-Lab/OCTIS/blob/master/examples/OCTIS_Optimizing_CTM.ipynb :alt: Open In Colab .. |twitter_silvia| image:: https://img.shields.io/twitter/follow/TerragniSilvia?style=social :target: https://twitter.com/intent/follow?screen_name=TerragniSilvia :alt: Follow TerragniSilvia on Twitter .. |twitter_betta| image:: https://img.shields.io/twitter/follow/FersiniE?style=social :target: https://twitter.com/intent/follow?screen_name=FersiniE :alt: Follow FersiniE on Twitter
.. image:: https://img.shields.io/pypi/v/octis.svg :target: https://pypi.python.org/pypi/octis
.. image:: https://github.com/MIND-Lab/OCTIS/workflows/Python%20package/badge.svg :target: https://github.com/MIND-Lab/OCTIS/actions
.. image:: https://readthedocs.org/projects/octis/badge/?version=latest :target: https://octis.readthedocs.io/en/latest/?badge=latest :alt: Documentation Status
.. image:: https://img.shields.io/github/contributors/MIND-Lab/OCTIS :target: https://github.com/MIND-Lab/OCTIS/graphs/contributors/ :alt: Contributors
.. image:: https://img.shields.io/badge/License-MIT-blue.svg :target: https://lbesson.mit-license.org/ :alt: License
.. image:: https://img.shields.io/github/stars/mind-lab/OCTIS?logo=github :target: https://github.com/mind-lab/OCTIS/stargazers :alt: Github Stars
.. image:: https://pepy.tech/badge/octis/month :target: https://pepy.tech/project/octis :alt: Monthly Downloads
.. image:: https://colab.research.google.com/assets/colab-badge.svg :target: https://colab.research.google.com/github/MIND-Lab/OCTIS/blob/master/examples/OCTIS_Optimizing_CTM.ipynb :alt: Open In Colab
.. image:: https://github.com/MIND-Lab/OCTIS/blob/master/logo.png?raw=true :width: 100 :alt: Logo
OCTIS (Optimizing and Comparing Topic models Is Simple) aims at training, analyzing and comparing
Topic Models, whose optimal hyperparameters are estimated by means of a Bayesian Optimization approach. This work has been accepted to the demo track of EACL2021. Click to read the paper_!
.. contents:: Table of Contents :depth: 2
Install
You can install OCTIS with the following command: ::
pip install octis
You can find the requirements in the requirements.txt file.
Main Features
- Preprocess your own dataset or use one of the already-preprocessed benchmark datasets
- Well-known topic models (both classical and neurals)
- Evaluate your model using different state-of-the-art evaluation metrics
- Optimize the models' hyperparameters for a given metric using Bayesian Optimization
- Python library for advanced usage or simple web dashboard for starting and controlling the optimization experiments
Examples and Tutorials
To easily understand how to use OCTIS, we invite you to try our tutorials out :)
+--------------------------------------------------------------------------------+------------------+ | Name | Link | +================================================================================+==================+ | How to build a topic model and evaluate the results (LDA on 20Newsgroups) | |colab1| | +--------------------------------------------------------------------------------+------------------+ | How to optimize the hyperparameters of a neural topic model (CTM on M10) | |colab2| | +--------------------------------------------------------------------------------+------------------+
Some tutorials on Medium:
Two guides on how to use OCTIS with practical examples:
A beginner's guide to OCTIS vol. 1_ byEmil Rijcken_A beginner's guide to OCTIS vol. 2_ byEmil Rijcken_
A tutorial on topic modeling on song lyrics:
OCTIS - The Future of Topic Modeling_ byNicolas Pogeant_
.. _Emil Rijcken: https://emilrijcken.medium.com/ .. _A beginner's guide to OCTIS vol. 1: https://towardsdatascience.com/a-beginners-guide-to-octis-optimizing-and-comparing-topic-models-is-simple-590554ec9ba6 .. _A beginner's guide to OCTIS vol. 2: https://towardsdatascience.com/a-beginners-guide-to-octis-vol-2-optimizing-topic-models-1214e58be1e5 .. _OCTIS - The Future of Topic Modeling: https://medium.com/mlearning-ai/octis-the-future-of-topic-modeling-45ef8cd66089 .. _Nicolas Pogeant: https://medium.com/@npogeant
Datasets and Preprocessing
Load a preprocessed dataset
To load one of the already preprocessed datasets as follows:
.. code-block:: python
from octis.dataset.dataset import Dataset dataset = Dataset() dataset.fetch_dataset("20NewsGroup")
Just use one of the dataset names listed below. Note: it is case-sensitive!
Available Datasets
+--------------+--------------+--------+---------+----------+----------+ |Name in OCTIS | Source | # Docs | # Words | # Labels | Language | +==============+==============+========+=========+==========+==========+ | 20NewsGroup | 20Newsgroup_ | 16309 | 1612 | 20 | English | +--------------+--------------+--------+---------+----------+----------+ | BBC_News | BBC-News_ | 2225 | 2949 | 5 | English | +--------------+--------------+--------+---------+----------+----------+ | DBLP | DBLP_ | 54595 | 1513 | 4 | English | +--------------+--------------+--------+---------+----------+----------+ | M10 | M10_ | 8355 | 1696 | 10 | English | +--------------+--------------+--------+---------+----------+----------+ | DBPedia_IT | DBPedia_IT_ | 4251 | 2047 | 5 | Italian | +--------------+--------------+--------+---------+----------+----------+ | Europarl_IT | Europarl_IT_ | 3613 | 2000 | NA | Italian | +--------------+--------------+--------+---------+----------+----------+
.. _20Newsgroup: https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html .. _BBC-News: https://github.com/MIND-Lab/OCTIS .. _DBLP: https://dblp.org/rec/conf/ijcai/PanWZZW16.html?view=bibtex .. _M10: https://dblp.org/rec/conf/ijcai/PanWZZW16.html?view=bibtex .. _DBPedia_IT: https://www.dbpedia.org/resources/ontology/ .. _Europarl_IT: https://www.statmt.org/europarl/
Load a Custom Dataset
Otherwise, you can load a custom preprocessed dataset in the following way:
.. code-block:: python
from octis.dataset.dataset import Dataset dataset = Dataset() dataset.load_custom_dataset_from_folder("../path/to/the/dataset/folder")
Make sure that the dataset is in the following format: * corpus file: a .tsv file (tab-separated) that contains up to three columns, i.e. the document, the partitition, and the label associated to the document (optional). * vocabulary: a .txt file where each line represents a word of the vocabulary
The partition can be "train" for the training partition, "test" for testing partition, or "val" for the validation partition. An example of dataset can be found here: sample_dataset_.
Disclaimer
Similarly to `TensorFlow Datasets`_ and HuggingFace's `nlp`_ library, we just downloaded and prepared public datasets. We do not host or distribute these datasets, vouch for their quality or fairness, or claim that you have license to use the dataset. It is your responsibility to determine whether you have permission to use the dataset under the dataset's license and to cite the right owner of the dataset.
If you're a dataset owner and wish to update any part of it, or do not want your dataset to be included in this library, please get in touch through a GitHub issue.
If you're a dataset owner and wish to include your dataset in this library, please get in touch through a GitHub issue.
Preprocess a Dataset
============================
To preprocess a dataset, import the preprocessing class and use the preprocess_dataset method.
.. code-block:: python
import os
import string
from octis.preprocessing.preprocessing import Preprocessing
os.chdir(os.path.pardir)
# Initialize preprocessing
preprocessor = Preprocessing(vocabulary=None, max_features=None,
remove_punctuation=True, punctuation=string.punctuation,
lemmatize=True, stopword_list='english',
min_chars=1, min_words_docs=0)
# preprocess
dataset = preprocessor.preprocess_dataset(documents_path=r'..\corpus.txt', labels_path=r'..\labels.txt')
# save the preprocessed dataset
dataset.save('hello_dataset')
For more details on the preprocessing see the preprocessing demo example in the examples folder.
*****************************
Topic Models and Evaluation
*****************************
Train a model
==============
To build a model, load a preprocessed dataset, set the model hyperparameters and use :code:`train_model()` to train the model.
.. code-block:: python
from octis.dataset.dataset import Dataset
from octis.models.LDA import LDA
# Load a dataset
dataset = Dataset()
dataset.load_custom_dataset_from_folder("dataset_folder")
model = LDA(num_topics=25) # Create model
