SummerTime
An open-source text summarization toolkit for non-experts. EMNLP'2021 Demo
Install / Use
/learn @Yale-LILY/SummerTimeREADME
SummerTime - Text Summarization Toolkit for Non-experts
<p align="left"> <a href="https://github.com/Yale-LILY/SummerTime/actions"> <img alt="CI" src="https://github.com/Yale-LILY/SummerTime/workflows/CI/badge.svg?event=push&branch=main"> </a> <a href="https://github.com/allenai/allennlp/blob/main/LICENSE"> <img alt="License" src="https://img.shields.io/github/license/Yale-LILY/SummerTime.svg?color=blue&cachedrop"> </a> <a href="https://colab.research.google.com/drive/19tPdBgaJ4_QjSiFyoxtpnFGW4OG1gTec?usp=sharing"> <img alt="Open In Colab" src="https://colab.research.google.com/assets/colab-badge.svg"> </a> <br/> </p>A library to help users choose appropriate summarization tools based on their specific tasks or needs. Includes models, evaluation metrics, and datasets.
The library architecture is as follows:
<p align="center"> <img src="https://raw.githubusercontent.com/Yale-LILY/SummerTime/main/docs/img/architecture.png" width="50%"> </p>NOTE: SummerTime is in active development, any helpful comments are highly encouraged, please open an issue or reach out to any of the team members.
Installation and setup
Install from PyPI (recommended)
# install extra dependencies first
pip install pyrouge@git+https://github.com/bheinzerling/pyrouge.git
pip install en_core_web_sm@https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0-py3-none-any.whl
# install summertime from PyPI
pip install summertime
Local pip installation
Alternatively, to enjoy the most recent features, you can install from the source:
git clone git@github.com:Yale-LILY/SummerTime
pip install -e .
Setup ROUGE (when using evaluation)
export ROUGE_HOME=/usr/local/lib/python3.7/dist-packages/summ_eval/ROUGE-1.5.5/
Quick Start
Imports model, initializes default model, and summarizes sample documents.
from summertime import model
sample_model = model.summarizer()
documents = [
""" PG&E stated it scheduled the blackouts in response to forecasts for high winds amid dry conditions.
The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were scheduled to be affected
by the shutoffs which were expected to last through at least midday tomorrow."""
]
sample_model.summarize(documents)
# ["California's largest electricity provider has turned off power to hundreds of thousands of customers."]
Also, please run our colab notebook for a more hands-on demo and more examples.
Models
Supported Models
SummerTime supports different models (e.g., TextRank, BART, Longformer) as well as model wrappers for more complex summarization tasks (e.g., JointModel for multi-doc summarzation, BM25 retrieval for query-based summarization). Several multilingual models are also supported (mT5 and mBART).
| Models | Single-doc | Multi-doc | Dialogue-based | Query-based | Multilingual | | --------- | :------------------: | :------------------: | :------------------: | :------------------: | :------------------: | | BartModel | :heavy_check_mark: | | | | | | BM25SummModel | | | | :heavy_check_mark: | | | HMNetModel | | | :heavy_check_mark: | | | | LexRankModel | :heavy_check_mark: | | | | | | LongformerModel | :heavy_check_mark: | | | | | | MBartModel | :heavy_check_mark: | | | | 50 languages (full list here) | | MT5Model | :heavy_check_mark: | | | | 101 languages (full list here) | | TranslationPipelineModel | :heavy_check_mark: | | | | ~70 languages | | MultiDocJointModel | | :heavy_check_mark: | | | | MultiDocSeparateModel | | :heavy_check_mark: | | | | PegasusModel | :heavy_check_mark: | | | | | TextRankModel | :heavy_check_mark: | | | | | TFIDFSummModel | | | | :heavy_check_mark: | |
To see all supported models, run:
from summertime.model import SUPPORTED_SUMM_MODELS
print(SUPPORTED_SUMM_MODELS)
Import and initialization:
from summertime import model
# To use a default model
default_model = model.summarizer()
# Or a specific model
bart_model = model.BartModel()
pegasus_model = model.PegasusModel()
lexrank_model = model.LexRankModel()
textrank_model = model.TextRankModel()
Users can easily access documentation to assist with model selection
default_model.show_capability()
pegasus_model.show_capability()
textrank_model.show_capability()
To use a model for summarization, simply run:
documents = [
""" PG&E stated it scheduled the blackouts in response to forecasts for high winds amid dry conditions.
The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were scheduled to be affected
by the shutoffs which were expected to last through at least midday tomorrow."""
]
default_model.summarize(documents)
# or
pegasus_model.summarize(documents)
All models can be initialized with the following optional options:
def __init__(self,
trained_domain: str=None,
max_input_length: int=None,
max_output_length: int=None,
):
All models will implement the following methods:
def summarize(self,
corpus: Union[List[str], List[List[str]]],
queries: List[str]=None) -> List[str]:
def show_capability(cls) -> None:
Datasets
Datasets supported
SummerTime supports different summarization datasets across different domains (e.g., CNNDM dataset - news article corpus, Samsum - dialogue corpus, QM-Sum - query-based dialogue corpus, MultiNews - multi-document corpus, ML-sum - multi-lingual corpus, PubMedQa - Medical domain, Arxiv - Science papers domain, among others.
| Dataset | Domain | # Examples | Src. length | Tgt. length | Query | Multi-doc | Dialogue | Multi-lingual | |-----------------|---------------------|-------------|-------------|-------------|--------------------|--------------------|--------------------|-------------------------------------------| | ArXiv | Scientific articles | 215k | 4.9k | 220 | | | | | | CNN/DM(3.0.0) | News | 300k | 781 | 56 | | | | | | MlsumDataset | Multi-lingual News | 1.5M+ | 632 | 34 | | :heavy_check_mark: | | German, Spanish, French, Russian, Turkish | | Multi-News | News | 56k | 2.1k | 263.8 | | :heavy_check_mark: | | | | SAMSum | Open-domain | 16k | 94 | 20 | | | :heavy_check_mark: | | | Pubmedqa | Medical | 272k | 244 | 32 | :heavy_check_mark: | | | | | QMSum | Meetings | 1k | 9.0k | 69.6 | :heavy_check_mark: | | :heavy_check_mark: | | | ScisummNet | Scientific articles | 1k | 4.7k | 150 | | | | | | SummScreen | TV shows | 26.9k | 6.6k | 337.4 | | | :heavy_check_mark: | | | XSum | News | 226k | 431 | 23.3 | | | | | | XLSum | News | 1.35m | ??? | ??? | | | | 45 languages (see documentation) | | MassiveSumm | News | 12m+ | ??? | ??? | | | | 78 languages (see Multilingual Summarization section of README for details) |
To see all supported datasets, run:
from summertime import dataset
