MemSum
Code for ACL 2022 paper on the topic of long document summarization: MemSum: Extractive Summarization of Long Documents Using Multi-Step Episodic Markov Decision Processes
Install / Use
/learn @nianlonggu/MemSumREADME
<a href="https://colab.research.google.com/github/nianlonggu/MemSum/blob/main/Training_Pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
MemSum: Extractive Summarization of Long Documents Using Multi-Step Episodic Markov Decision Processes
Code for ACL 2022 paper on the topic of long document extractive summarization: MemSum: Extractive Summarization of Long Documents Using Multi-Step Episodic Markov Decision Processes.
Set Up Environment
-
create an Anaconda environment, with a name e.g. memsum
Note: Without further notification, the following commands need to be run in the working directory where this jupyter notebook is located.
conda create -n memsum python=3.10 -
activate this environment
source activate memsum -
Install pytorch (GPU version).
pip install torch torchvision torchaudio -
Install dependencies via pip
pip install -r requirements.txt
Download Datasets and Pretrained Model Checkpoints
Download All Datasets Used in the Paper
import os
import subprocess
import wget
for dataset_name in [ "arxiv", "pubmed", "gov-report"]:
print(dataset_name)
os.makedirs( "data/"+dataset_name, exist_ok=True )
## dataset is stored at huggingface hub
train_dataset_path = f"https://huggingface.co/datasets/nianlong/long-doc-extractive-summarization-{dataset_name}/resolve/main/train.jsonl"
val_dataset_path = f"https://huggingface.co/datasets/nianlong/long-doc-extractive-summarization-{dataset_name}/resolve/main/val.jsonl"
test_dataset_path = f"https://huggingface.co/datasets/nianlong/long-doc-extractive-summarization-{dataset_name}/resolve/main/test.jsonl"
wget.download( train_dataset_path, out = "data/"+dataset_name )
wget.download( val_dataset_path, out = "data/"+dataset_name )
wget.download( test_dataset_path, out = "data/"+dataset_name )
Download Pretrained Model Checkpoints
The trained MemSum model checkpoints are stored on huggingface hub
from huggingface_hub import snapshot_download
## download the pretrained glove word embedding (200 dimension)
snapshot_download('nianlong/memsum-word-embedding', local_dir = "model/word_embedding" )
## download model checkpoint on the arXiv dataset
snapshot_download('nianlong/memsum-arxiv-summarization', local_dir = "model/memsum-arxiv" )
## download model checkpoint on the PubMed dataset
snapshot_download('nianlong/memsum-pubmed-summarization', local_dir = "model/memsum-pubmed" )
## download model checkpoint on the Gov-Report dataset
snapshot_download('nianlong/memsum-gov-report-summarization', local_dir = "model/memsum-gov-report" )
Testing Pretrained Model on a Given Dataset
For example, the following command test the performance of the full MemSum model. Berfore runing these codes, make sure current working directory is the main directory "MemSum/" where the .py file summarizers.py is located.
from src.summarizer import MemSum
from tqdm import tqdm
from rouge_score import rouge_scorer
import json
import numpy as np
rouge_cal = rouge_scorer.RougeScorer(['rouge1','rouge2', 'rougeLsum'], use_stemmer=True)
memsum_arxiv = MemSum( "model/memsum-arxiv/model.pt",
"model/word_embedding/vocabulary_200dim.pkl",
gpu = 0 , max_doc_len = 500 )
memsum_pubmed = MemSum( "model/memsum-pubmed/model.pt",
"model/word_embedding/vocabulary_200dim.pkl",
gpu = 0 , max_doc_len = 500 )
memsum_gov_report = MemSum( "model/memsum-gov-report/model.pt",
"model/word_embedding/vocabulary_200dim.pkl",
gpu = 0 , max_doc_len = 500 )
test_corpus_arxiv = [ json.loads(line) for line in open("data/arxiv/test.jsonl") ]
test_corpus_pubmed = [ json.loads(line) for line in open("data/pubmed/test.jsonl") ]
test_corpus_gov_report = [ json.loads(line) for line in open("data/gov-report/test.jsonl") ]
Evaluation on ROUGE
def evaluate( model, corpus, p_stop, max_extracted_sentences, rouge_cal ):
scores = []
for data in tqdm(corpus):
gold_summary = data["summary"]
extracted_summary = model.extract( [data["text"]], p_stop_thres = p_stop, max_extracted_sentences_per_document = max_extracted_sentences )[0]
score = rouge_cal.score( "\n".join( gold_summary ), "\n".join(extracted_summary) )
scores.append( [score["rouge1"].fmeasure, score["rouge2"].fmeasure, score["rougeLsum"].fmeasure ] )
return np.asarray(scores).mean(axis = 0)
evaluate( memsum_arxiv, test_corpus_arxiv, 0.5, 5, rouge_cal )
100%|█████████████████████████████████████████████████████████████| 6440/6440 [08:00<00:00, 13.41it/s]
array([0.47946925, 0.19970128, 0.42075852])
evaluate( memsum_pubmed, test_corpus_pubmed, 0.6, 7, rouge_cal )
100%|█████████████████████████████████████████████████████████████| 6658/6658 [09:22<00:00, 11.84it/s]
array([0.49260137, 0.22916328, 0.44415123])
evaluate( memsum_gov_report, test_corpus_gov_report, 0.6, 22, rouge_cal )
100%|███████████████████████████████████████████████████████████████| 973/973 [04:33<00:00, 3.55it/s]
array([0.59445629, 0.28507926, 0.56677073])
Summarization Examples
Given a document with a list of sentences, e.g.:
document = test_corpus_pubmed[0]["text"]
We can summarize this document extractively by:
extracted_summary = memsum_pubmed.extract( [ document ],
p_stop_thres = 0.6,
max_extracted_sentences_per_document = 7
)[0]
extracted_summary
['more specifically , we found that pd patients with anxiety were more impaired on the trail making test part b which assessed attentional set - shifting , on both digit span tests which assessed working memory and attention , and to a lesser extent on the logical memory test which assessed memory and new verbal learning compared to pd patients without anxiety . taken together ,',
'this study is the first to directly compare cognition between pd patients with and without anxiety .',
'results from this study showed selective verbal memory deficits in rpd patients with anxiety compared to rpd without anxiety , whereas lpd patients with anxiety had greater attentional / working memory deficits compared to lpd without anxiety .',
'given that research on healthy young adults suggests that anxiety reduces processing capacity and impairs processing efficiency , especially in the central executive and attentional systems of working memory [ 26 , 27 ] , we hypothesized that pd patients with anxiety would show impairments in attentional set - shifting and working memory compared to pd patients without anxiety .',
'the findings confirmed our hypothesis that anxiety negatively influences attentional set - shifting and working memory in pd .',
'seventeen pd patients with anxiety and thirty - three pd patients without anxiety were included in this study ( see table 1 ) .']
We can also get the indices of the extracted sentences in the original document:
extracted_summary_batch, extracted_indices_batch = memsum_pubmed.extract( [ document ],
p_stop_thres = 0.6,
max_extracted_sentences_per_document = 7,
return_sentence_position=1
)
extracted_summary_batch[0]
['more specifically , we found that pd patients with anxiety were more impaired on the trail making test part b which assessed attentional set - shifting , on both digit span tests which assessed working memory and attention , and to a lesser extent on the logical memory test which assessed memory and new verbal learning compared to pd patients without anxiety . taken together ,',
'this study is the first to directly compare cognition between pd patients with and without anxiety .',
'results from this study showed selective verbal memory deficits in rpd patients with anxiety compared to rpd without anxiety , whereas lpd patients with anxiety had greater attentional / working memory deficits compared to lpd without anxiety .',
'given that research on healthy young adults suggests that anxiety reduces processing capacity and impairs processing efficiency , especially in the central executive and attentional systems of working memory [ 26 , 27 ] , we hypothesized that pd patients with anxiety would show impairments in attentional set - shifting and working memory compared to pd patients without anxiety .',
'the findings confirmed our hypothesis that anxiety negatively influences attentional set - shifting and working memory in pd .',
'seventeen pd patients with anxiety and thirty - three pd patients without anxiety were included in this study ( see table 1 ) .']
extracted_indices_batch[0]
[50, 48, 70, 14, 49, 16]
Training MemSum
Please refer to the documentation Training_Pipeline.md for the complete pipeline of training MemSum on custom dataset.
You can also directly run the training pipeline on google colab: <a href="https://colab.research.google.com/github/nianlonggu/MemSum/blob/main/Training_Pipeline.ipynb" target="_parent
Related Skills
YC-Killer
2.7kA library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.
best-practices-researcher
The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app
groundhog
398Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).
isf-agent
a repo for an agent that helps researchers apply for isf funding
