SummerTime - Text Summarization Toolkit for Non-experts

A library to help users choose appropriate summarization tools based on their specific tasks or needs. Includes models, evaluation metrics, and datasets.

The library architecture is as follows:

NOTE: SummerTime is in active development, any helpful comments are highly encouraged, please open an issue or reach out to any of the team members.

Installation and setup

Install from PyPI (recommended)

# install extra dependencies first
pip install pyrouge@git+https://github.com/bheinzerling/pyrouge.git
pip install en_core_web_sm@https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0-py3-none-any.whl

# install summertime from PyPI
pip install summertime

Local `pip` installation

Alternatively, to enjoy the most recent features, you can install from the source:

git clone git@github.com:Yale-LILY/SummerTime
pip install -e .

Setup `ROUGE` (when using evaluation)

export ROUGE_HOME=/usr/local/lib/python3.7/dist-packages/summ_eval/ROUGE-1.5.5/

Quick Start

Imports model, initializes default model, and summarizes sample documents.

from summertime import model

sample_model = model.summarizer()
documents = [
    """ PG&E stated it scheduled the blackouts in response to forecasts for high winds amid dry conditions. 
    The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were scheduled to be affected 
    by the shutoffs which were expected to last through at least midday tomorrow."""
]
sample_model.summarize(documents)

# ["California's largest electricity provider has turned off power to hundreds of thousands of customers."]

Also, please run our colab notebook for a more hands-on demo and more examples.

Models

Supported Models

SummerTime supports different models (e.g., TextRank, BART, Longformer) as well as model wrappers for more complex summarization tasks (e.g., JointModel for multi-doc summarzation, BM25 retrieval for query-based summarization). Several multilingual models are also supported (mT5 and mBART).

| Models | Single-doc | Multi-doc | Dialogue-based | Query-based | Multilingual | | --------- | :------------------: | :------------------: | :------------------: | :------------------: | :------------------: | | BartModel | :heavy_check_mark: | | | | | | BM25SummModel | | | | :heavy_check_mark: | | | HMNetModel | | | :heavy_check_mark: | | | | LexRankModel | :heavy_check_mark: | | | | | | LongformerModel | :heavy_check_mark: | | | | | | MBartModel | :heavy_check_mark: | | | | 50 languages (full list here) | | MT5Model | :heavy_check_mark: | | | | 101 languages (full list here) | | TranslationPipelineModel | :heavy_check_mark: | | | | ~70 languages | | MultiDocJointModel | | :heavy_check_mark: | | | | MultiDocSeparateModel | | :heavy_check_mark: | | | | PegasusModel | :heavy_check_mark: | | | | | TextRankModel | :heavy_check_mark: | | | | | TFIDFSummModel | | | | :heavy_check_mark: | |

To see all supported models, run:

from summertime.model import SUPPORTED_SUMM_MODELS
print(SUPPORTED_SUMM_MODELS)

Import and initialization:

from summertime import model

# To use a default model
default_model = model.summarizer()    

# Or a specific model
bart_model = model.BartModel()
pegasus_model = model.PegasusModel()
lexrank_model = model.LexRankModel()
textrank_model = model.TextRankModel()

Users can easily access documentation to assist with model selection

default_model.show_capability()
pegasus_model.show_capability()
textrank_model.show_capability()

To use a model for summarization, simply run:

documents = [
    """ PG&E stated it scheduled the blackouts in response to forecasts for high winds amid dry conditions. 
    The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were scheduled to be affected 
    by the shutoffs which were expected to last through at least midday tomorrow."""
]

default_model.summarize(documents)
# or 
pegasus_model.summarize(documents)

All models can be initialized with the following optional options:

def __init__(self,
         trained_domain: str=None,
         max_input_length: int=None,
         max_output_length: int=None,
         ):

All models will implement the following methods:

def summarize(self,
  corpus: Union[List[str], List[List[str]]],
  queries: List[str]=None) -> List[str]:

def show_capability(cls) -> None:

Datasets

Datasets supported

SummerTime supports different summarization datasets across different domains (e.g., CNNDM dataset - news article corpus, Samsum - dialogue corpus, QM-Sum - query-based dialogue corpus, MultiNews - multi-document corpus, ML-sum - multi-lingual corpus, PubMedQa - Medical domain, Arxiv - Science papers domain, among others.

| Dataset | Domain | # Examples | Src. length | Tgt. length | Query | Multi-doc | Dialogue | Multi-lingual | |-----------------|---------------------|-------------|-------------|-------------|--------------------|--------------------|--------------------|-------------------------------------------| | ArXiv | Scientific articles | 215k | 4.9k | 220 | | | | | | CNN/DM(3.0.0) | News | 300k | 781 | 56 | | | | | | MlsumDataset | Multi-lingual News | 1.5M+ | 632 | 34 | | :heavy_check_mark: | | German, Spanish, French, Russian, Turkish | | Multi-News | News | 56k | 2.1k | 263.8 | | :heavy_check_mark: | | | | SAMSum | Open-domain | 16k | 94 | 20 | | | :heavy_check_mark: | | | Pubmedqa | Medical | 272k | 244 | 32 | :heavy_check_mark: | | | | | QMSum | Meetings | 1k | 9.0k | 69.6 | :heavy_check_mark: | | :heavy_check_mark: | | | ScisummNet | Scientific articles | 1k | 4.7k | 150 | | | | | | SummScreen | TV shows | 26.9k | 6.6k | 337.4 | | | :heavy_check_mark: | | | XSum | News | 226k | 431 | 23.3 | | | | | | XLSum | News | 1.35m | ??? | ??? | | | | 45 languages (see documentation) | | MassiveSumm | News | 12m+ | ??? | ??? | | | | 78 languages (see Multilingual Summarization section of README for details) |

To see all supported datasets, run:

from summertime import dataset

SummerTime

Install / Use

README