========================================================= OCTIS : Optimizing and Comparing Topic Models is Simple!

.. |colab1| image:: https://colab.research.google.com/assets/colab-badge.svg :target: https://colab.research.google.com/github/MIND-Lab/OCTIS/blob/master/examples/OCTIS_LDA_training_only.ipynb :alt: Open In Colab

.. |colab2| image:: https://colab.research.google.com/assets/colab-badge.svg :target: https://colab.research.google.com/github/MIND-Lab/OCTIS/blob/master/examples/OCTIS_Optimizing_CTM.ipynb :alt: Open In Colab .. |twitter_silvia| image:: https://img.shields.io/twitter/follow/TerragniSilvia?style=social :target: https://twitter.com/intent/follow?screen_name=TerragniSilvia :alt: Follow TerragniSilvia on Twitter .. |twitter_betta| image:: https://img.shields.io/twitter/follow/FersiniE?style=social :target: https://twitter.com/intent/follow?screen_name=FersiniE :alt: Follow FersiniE on Twitter

.. image:: https://img.shields.io/pypi/v/octis.svg :target: https://pypi.python.org/pypi/octis

.. image:: https://github.com/MIND-Lab/OCTIS/workflows/Python%20package/badge.svg :target: https://github.com/MIND-Lab/OCTIS/actions

.. image:: https://readthedocs.org/projects/octis/badge/?version=latest :target: https://octis.readthedocs.io/en/latest/?badge=latest :alt: Documentation Status

.. image:: https://img.shields.io/github/contributors/MIND-Lab/OCTIS :target: https://github.com/MIND-Lab/OCTIS/graphs/contributors/ :alt: Contributors

.. image:: https://img.shields.io/badge/License-MIT-blue.svg :target: https://lbesson.mit-license.org/ :alt: License

.. image:: https://img.shields.io/github/stars/mind-lab/OCTIS?logo=github :target: https://github.com/mind-lab/OCTIS/stargazers :alt: Github Stars

.. image:: https://pepy.tech/badge/octis/month :target: https://pepy.tech/project/octis :alt: Monthly Downloads

.. image:: https://colab.research.google.com/assets/colab-badge.svg :target: https://colab.research.google.com/github/MIND-Lab/OCTIS/blob/master/examples/OCTIS_Optimizing_CTM.ipynb :alt: Open In Colab

.. image:: https://github.com/MIND-Lab/OCTIS/blob/master/logo.png?raw=true :width: 100 :alt: Logo

OCTIS (Optimizing and Comparing Topic models Is Simple) aims at training, analyzing and comparing Topic Models, whose optimal hyperparameters are estimated by means of a Bayesian Optimization approach. This work has been accepted to the demo track of EACL2021. Click to read the paper_!

.. contents:: Table of Contents :depth: 2

Install

You can install OCTIS with the following command: ::

pip install octis

You can find the requirements in the requirements.txt file.

Main Features

Preprocess your own dataset or use one of the already-preprocessed benchmark datasets
Well-known topic models (both classical and neurals)
Evaluate your model using different state-of-the-art evaluation metrics
Optimize the models' hyperparameters for a given metric using Bayesian Optimization
Python library for advanced usage or simple web dashboard for starting and controlling the optimization experiments

Examples and Tutorials

To easily understand how to use OCTIS, we invite you to try our tutorials out :)

+--------------------------------------------------------------------------------+------------------+ | Name | Link | +================================================================================+==================+ | How to build a topic model and evaluate the results (LDA on 20Newsgroups) | |colab1| | +--------------------------------------------------------------------------------+------------------+ | How to optimize the hyperparameters of a neural topic model (CTM on M10) | |colab2| | +--------------------------------------------------------------------------------+------------------+

Some tutorials on Medium:

Two guides on how to use OCTIS with practical examples:

A beginner's guide to OCTIS vol. 1_ by Emil Rijcken_
A beginner's guide to OCTIS vol. 2_ by Emil Rijcken_

A tutorial on topic modeling on song lyrics:

OCTIS - The Future of Topic Modeling_ by Nicolas Pogeant_

.. _Emil Rijcken: https://emilrijcken.medium.com/ .. _A beginner's guide to OCTIS vol. 1: https://towardsdatascience.com/a-beginners-guide-to-octis-optimizing-and-comparing-topic-models-is-simple-590554ec9ba6 .. _A beginner's guide to OCTIS vol. 2: https://towardsdatascience.com/a-beginners-guide-to-octis-vol-2-optimizing-topic-models-1214e58be1e5 .. _OCTIS - The Future of Topic Modeling: https://medium.com/mlearning-ai/octis-the-future-of-topic-modeling-45ef8cd66089 .. _Nicolas Pogeant: https://medium.com/@npogeant

Datasets and Preprocessing

Load a preprocessed dataset

To load one of the already preprocessed datasets as follows:

.. code-block:: python

from octis.dataset.dataset import Dataset dataset = Dataset() dataset.fetch_dataset("20NewsGroup")

Just use one of the dataset names listed below. Note: it is case-sensitive!

Available Datasets

+--------------+--------------+--------+---------+----------+----------+ |Name in OCTIS | Source | # Docs | # Words | # Labels | Language | +==============+==============+========+=========+==========+==========+ | 20NewsGroup | 20Newsgroup_ | 16309 | 1612 | 20 | English | +--------------+--------------+--------+---------+----------+----------+ | BBC_News | BBC-News_ | 2225 | 2949 | 5 | English | +--------------+--------------+--------+---------+----------+----------+ | DBLP | DBLP_ | 54595 | 1513 | 4 | English | +--------------+--------------+--------+---------+----------+----------+ | M10 | M10_ | 8355 | 1696 | 10 | English | +--------------+--------------+--------+---------+----------+----------+ | DBPedia_IT | DBPedia_IT_ | 4251 | 2047 | 5 | Italian | +--------------+--------------+--------+---------+----------+----------+ | Europarl_IT | Europarl_IT_ | 3613 | 2000 | NA | Italian | +--------------+--------------+--------+---------+----------+----------+

.. _20Newsgroup: https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html .. _BBC-News: https://github.com/MIND-Lab/OCTIS .. _DBLP: https://dblp.org/rec/conf/ijcai/PanWZZW16.html?view=bibtex .. _M10: https://dblp.org/rec/conf/ijcai/PanWZZW16.html?view=bibtex .. _DBPedia_IT: https://www.dbpedia.org/resources/ontology/ .. _Europarl_IT: https://www.statmt.org/europarl/

Load a Custom Dataset

Otherwise, you can load a custom preprocessed dataset in the following way:

.. code-block:: python

from octis.dataset.dataset import Dataset dataset = Dataset() dataset.load_custom_dataset_from_folder("../path/to/the/dataset/folder")

Make sure that the dataset is in the following format: * corpus file: a .tsv file (tab-separated) that contains up to three columns, i.e. the document, the partitition, and the label associated to the document (optional). * vocabulary: a .txt file where each line represents a word of the vocabulary

The partition can be "train" for the training partition, "test" for testing partition, or "val" for the validation partition. An example of dataset can be found here: sample_dataset_.

Disclaimer


Similarly to `TensorFlow Datasets`_ and HuggingFace's `nlp`_ library, we just downloaded and prepared public datasets. We do not host or distribute these datasets, vouch for their quality or fairness, or claim that you have license to use the dataset. It is your responsibility to determine whether you have permission to use the dataset under the dataset's license and to cite the right owner of the dataset.

If you're a dataset owner and wish to update any part of it, or do not want your dataset to be included in this library, please get in touch through a GitHub issue.

If you're a dataset owner and wish to include your dataset in this library, please get in touch through a GitHub issue.

Preprocess a Dataset
============================

To preprocess a dataset, import the preprocessing class and use the preprocess_dataset method.

.. code-block:: python


    import os
    import string
    from octis.preprocessing.preprocessing import Preprocessing
    os.chdir(os.path.pardir)

    # Initialize preprocessing
    preprocessor = Preprocessing(vocabulary=None, max_features=None, 
                                 remove_punctuation=True, punctuation=string.punctuation,
                                 lemmatize=True, stopword_list='english',
                                 min_chars=1, min_words_docs=0)
    # preprocess
    dataset = preprocessor.preprocess_dataset(documents_path=r'..\corpus.txt', labels_path=r'..\labels.txt')

    # save the preprocessed dataset
    dataset.save('hello_dataset')


For more details on the preprocessing see the preprocessing demo example in the examples folder.


*****************************
Topic Models and Evaluation
*****************************

Train a model
==============

To build a model, load a preprocessed dataset, set the model hyperparameters and use :code:`train_model()` to train the model.

.. code-block:: python

    from octis.dataset.dataset import Dataset
    from octis.models.LDA import LDA

    # Load a dataset
    dataset = Dataset()
    dataset.load_custom_dataset_from_folder("dataset_folder")

    model = LDA(num_topics=25)  # Create model

OCTIS

Install / Use

README

========================================================= OCTIS : Optimizing and Comparing Topic Models is Simple!

Some tutorials on Medium:

Load a preprocessed dataset

Available Datasets

Load a Custom Dataset