Summarus
Models for automatic abstractive summarization
Install / Use
/learn @IlyaGusev/SummarusREADME
summarus
Abstractive and extractive summarization models, mostly for Russian language. Building on top of AllenNLP
You can also checkout the MBART-based Russian summarization model on Huggingface: mbart_ru_sum_gazeta
Based on the following papers:
- SummaRuNNer: A Recurrent Neural Network based Sequence Model for Extractive Summarization of Documents
- Get To The Point: Summarization with Pointer-Generator Networks
- Self-Attentive Model for Headline Generation
- Text Summarization with Pretrained Encoders
- Multilingual Denoising Pre-training for Neural Machine Translation
Contacts
- Telegram: @YallenGusev
Prerequisites
pip install -r requirements.txt
Commands
train.sh
Script for training a model based on AllenNLP 'train' command.
| Argument | Required | Description | |:---------|:---------|--------------------------------------------------| | -c | true | path to file with configuration | | -s | true | path to directory where model will be saved | | -t | true | path to train dataset | | -v | true | path to val dataset | | -r | false | recover from checkpoint |
predict.sh
Script for model evaluation. The test dataset should have the same format as the train dataset.
| Argument | Required | Default | Description | |:---------|:---------|:--------|:-----------------------------------------------------------------| | -t | true | | path to test dataset | | -m | true | | path to tar.gz archive with model | | -p | true | | name of Predictor | | -c | false | 0 | CUDA device | | -L | true | | Language ("ru" or "en") | | -b | false | 32 | size of a batch with test examples to run simultaneously | | -M | false | | path to meteor.jar for Meteor metric | | -T | false | | tokenize gold and predicted summaries before metrics calculation | | -D | false | | save temporary files with gold and predicted summaries |
summarus.util.train_subword_model
Script for subword model training.
| Argument | Default | Description | |:------------------|:--------|:-------------------------------------------------------------------| | --train-path | | path to train dataset | | --model-path | | path to directory where generated subword model will be saved | | --model-type | bpe | type of subword model, see sentencepiece | | --vocab-size | 50000 | size of the resulting subword model vocabulary | | --config-path | | path to file with configuration for DatasetReader (with parse_set) |
Headline generation
- First paper: Importance of Copying Mechanism for News Headline Generation
- Slides: Importance of Copying Mechanism for News Headline Generation
- Second paper: Advances of Transformer-Based Models for News Headline Generation
Dataset splits:
- RIA original dataset: https://github.com/RossiyaSegodnya/ria_news_dataset
- RIA train/val/test: https://www.dropbox.com/s/rermx1r8lx9u7nl/ria.tar.gz
- RIA dataset preprocessed for mBART: https://www.dropbox.com/s/iq2ih8sztygvz0m/ria_data_mbart_512_200.tar.gz
- Lenta original dataset: https://github.com/yutkin/Lenta.Ru-News-Dataset
- Lenta train/val/test: https://www.dropbox.com/s/v9i2nh12a4deuqj/lenta.tar.gz
- Lenta dataset preprocessed for mBART: https://www.dropbox.com/s/4oo8jazmw3izqvr/lenta_mbart_data_512_200.tar.gz
- Telegram train dataset with split: https://www.dropbox.com/s/ykqk49a8avlmnaf/ru_all_split.tar.gz
- Telegram test dataset with multiple references: https://github.com/dialogue-evaluation/Russian-News-Clustering-and-Headline-Generation/blob/main/data/headline_generation/headline_generation_answers.jsonl
Models:
Prediction script:
./predict.sh -t <path_to_test_dataset> -m ria_pgn_24kk.tar.gz -p subwords_summary -L ru
Results
Train dataset: RIA, test dataset: RIA
| Model | R-1-f | R-2-f | R-L-f | BLEU | |:--------------------------|:------|:------|:------|:------| | ria_copynet_10kk | 40.0 | 23.3 | 37.5 | - | | ria_pgn_24kk | 42.3 | 25.1 | 39.6 | - | | ria_mbart | 42.8 | 25.5 | 39.9 | - | | First Sentence | 24.1 | 10.6 | 16.7 | - |
Train dataset: RIA, eval dataset: Lenta
| Model | R-1-f | R-2-f | R-L-f | BLEU | |:--------------------------|:------|:------|:------|:------| | ria_copynet_10kk | 25.6 | 12.3 | 23.0 | - | | ria_pgn_24kk | 26.4 | 12.3 | 24.0 | - | | ria_mbart | 30.3 | 14.5 | 27.1 | - | | First Sentence | 25.5 | 11.2 | 19.2 | - |
Summarization - CNN/DailyMail
Dataset splits:
- CNN/DailyMail jsonl dataset: https://www.dropbox.com/s/35ezpg78rtukkgh/cnn_dm_jsonl.tar.gz
Models:
Prediction script:
./predict.sh -t <path_to_test_dataset> -m cnndm_pgn_25kk.tar.gz -p words_summary -L en -R
Results:
| Model | R-1-f | R-2-f | R-L-f | METEOR | BLEU | |:--------------------------|:------|:------|:------|:-------|:-----| | cnndm_pgn_25kk | 38.5 | 16.5 | 33.4 | 17.6 | - |
Summarization - Gazeta, russian news dataset
- Paper: Dataset for Automatic Summarization of Russian News
- Gazeta dataset: https://github.com/IlyaGusev/gazeta
- Usage examples:
Models:
- gazeta_pgn_7kk
- gazeta_pgn_7kk_cov.tar.gz
- gazeta_pgn_25kk
- gazeta_pgn_words_13kk.tar.gz
- gazeta_summarunner_3kk
Prediction scripts:
./predict.sh -t <path_to_test_dataset> -m gazeta_pgn_7kk.tar.gz -p subwords_summary -L ru -T
./predict.sh -t <path_to_test_dataset> -m gazeta_summarunner_3kk.tar.gz -p subwords_summary_sentences -L ru -T
External models:
Results:
| Model | R-1-f | R-2-f | R-L-f | METEOR | BLEU | |:--------------------------|:------|:------|:------|:-------|:-----| | gazeta_pgn_7kk | 29.4 | 12.7 | 24.6 | 21.2 | 9.0 | | gazeta_pgn_7kk_cov | 29.8 | 12.8 | 25.4 | 22.1 | 10.1 | | gazeta_pgn_25kk | 29.6 | 12.8 | 24.6 | 21.5 | 9.3 | | gazeta_pgn_words_13kk | 29.4 | 12.6 | 24.4 | 20.9 | 8.9 | | gazeta_summarunner_3kk | 31.6 | 13.7 | 27.1 | 26.0 | 11.5 | | gazeta_mbart | 32.6 | 14.6 | 28.2 | 25.7 | 12.4 | | gazeta_mbart_lower | 32.7 | 14.7 | 28.3 | 25.8 | 12.5 |
Demo
python demo/server.py --include-package summarus --model-dir <model_dir> --host <host> --port <port>
Citations
Headline generation (PGN):
@article{Gusev2019headlines,
author={Gusev, I.O.},
title={Importance of copying mechanism for news headline generation},
journal={Komp'juternaja Lingvistika i Intellektual'nye Tehnologii},
year={2019},
volume={2019-May},
number={18},
pages={229--236}
}
Headline generation (transformers):
@InProceedings{Bukhtiyarov2020headlines,
author={Bukhtiyarov, Alexey and Gusev, Ilya},
title="Advances of Transformer-Based Models for News Headline Generation",
booktitle="Artificial Intelligence and Natural Language",
year="2020",
publisher="Springer International Publishing",
address="Cham",
pages={54-
