Textsummarization
text summarization with supervised learning and atomic events
Install / Use
/learn @kashgupta/TextsummarizationREADME
<span style="color:#1EAAF1">Computational Linguistics</span>
<span style="color:#1EAAF1">Extractive Text Summarization</span>
Aditya Kashyap, Garvit Gupta, Jialin Wang, Ignacio Arranz, Kashish Gupta and Shiva Suri
25th April, 2018
A more readable Google Docs version of this document can be found at https://docs.google.com/document/d/1S_8wigOa9RhUFnq2uvQJ6bjre7rBkhEJw7ew-KdFEM8/edit?usp=sharing
<span style="color:#1EAAF1">Table of Contents</span>
- Title & Authors
- Table of Contents
- Abstract
- Introduction
- Literature Review
- Experimental Design
- Experimental Results
- Conclusion
- Acknowledgements
<span style="color:#1EAAF1">Abstract</span>
This paper discusses our attempts to achieve a strong ROUGE-2 measure with regards to extractive text summarization by harnessing several existing methods. Namely, we implemented and combined atomic event extraction, word frequency, matrix decomposition (i.e. factorization into canonical form), and finally two different supervised models. We found that achieving an initial ROUGE-2 score of almost 5% was fairly straightforward (we were able to reach this score via crudely extracting the first sentence of each article): for a long time, the best performing extractive summarizers were those which just picked the first n sentences! However, meliorating performance proved to be a challenging task given the intrinsic complexity of a summary. While state-of-the-art ROUGE-2 bigram f-scores for extractive text summarization are roughly 11%, the highest ROUGE-2 score we obtained was 9.7% and a ROUGE-1 over 32%.
<span style="color:#1EAAF1">Introduction</span>
A. Our task
Our task was to find the n most important sentences in a given paper which contained the most significant and greatest number of concepts with respect to the rest of the paper. Thus, an initial objective was to determine a metric that identified concept. After we were able to establish this baseline, we challenged ourselves to mathematically clean this data (e.g. matrix normalization & decomposition), and to continually expand upon our definition of a ‘concept’ until we found our best performing model.
Our model’s inputs were the documents we would like to summarize along with the target number of sentences. For example, given this text and a target of three sentences, a gold standard summary produces the following results.
Figure : I/O Extraction Pipeline

B. Problem Definition
The average human being can read 200 words/minute, limiting how much knowledge they can consume in any given day. Additionally, when reading an article, paper, blog, etc. it is hard to determine which parts of the text are important and which parts can be omitted. Using extractive first (abstractive if we have time) we would like to summarize text for the following use cases:
● Text summarization for news articles
● Academic papers
● Group chats
● Legal documents
● Financial statements
Of course the use cases get more specialized as the list proceeds, so we would start by creating a generalized text summarizer and then allowing it to categorize text by field and then summarize differently for different fields. For example, financial statements all have similar structure, as do legal documents, so the parser would find common structure between different documents and summarize them in precified ways. Additionally, we would like to gather features as described above much further than a generalized use case. For example, we would optically look for bolded words, words that appear as part of a figure or equation, and words that appear in a title or subheading.
C. Motivations
Ultimately, we chose this final project due to its ability to be turned into a product and marketed. Text summarization seems to have countless applications which can be easily realized (e.g. a web-application or chrome extension), and we all agreed that our model-turned-tool would be very useful in our daily lives. Furthermore, while abstractive text summarization seemed out-of-scope given modern computational linguistics and moreover the inevitable difficulties that would arise during testing, an extractive approach proved tractable given our growing and relevant expertise in neuro-linguistic processing (e.g. tf-idf matrix, [cosine] similarity measurements, name-entity recognition, part-of-speech tagging, paraphrasing, text simplification, neural nets, etc.).
<span style="color:#1EAAF1">Literature Review</span>
Papers:
- Text summarization via hidden Markov models - John M Conroy, Dianne P. O'leary
- Event-Based Extractive Summarization - Elena Filatova, Vasileios Hatzivassiloglou
- Using Lexical Chains for Text Summarization - Regina Barzilay, Michael Elhadad
- Extractive Summarization Using Supervised and Semi-supervised Learning - Kam-Fai Wong, Mingli Wu, Wenjie Li
- Single Document Extractive Text Summarization using Genetic Algorithms (2014) - Nilardi Chatterji, Amol Mittal, Shubham Goyal
Text summarization via hidden Markov models
https://www.researchgate.net/publication/224890789_Text_summarization_via_hidden _Markov_models
The paper presents two new methods:
- QR Matrix Decomposition
- Hidden Markov Models
QR Matrix Decomposition First document summarized by seeking the main ideas: ● SRA’s NameTagTM [Kru95] is used to recognize named entities ● WordNet is used to associate synonyms.
Once the terms are defined, a term-sentence matrix is formed. The job of the automatic summarization system is to choose a small subset of these vectors to cover the main ideas (terms) in the document. The QR Matrix Decomposition iteratively chooses the vector with the largest weight. Then the weights of the remaining sentences are updated to emphasize ideas not contained in the first sentence.
- Parse the document for terms using the Brill tagger [Bri93] (in this approach, terms are equivalent to ideas)
- Create a term-sentence matrix A: if word i appears in sentence j, the value of Aij should be 1 . If it doesn’t, it should be 0 .
- Algorithm does as follows (looping):
- Choose a sentence by measuring the norm (euclidean length) of its column,and choose the one with highest value.
- Remove that sentence (column) from A
- Subtract 1 from all the words (rows) from A that belong to the sentence above (this is called decomposition)
- At each stage of the algorithm, the next sentence that we choose is the one with the largest norm.
- Create weighting factors of sentences by their position in the document:
- Nonzeros in column j of matrix Aw are (gexp(-8j/n)+t) times those in A.
- N is the number of sentences, g and t are parameters TBD.
- Using a training (development) set of documents, the parameter t is determined so that the function above has a tail approx. the same height as the histogram of the distribution of summary sentences.
c. Given t, we determine g and the percent of the document to capture by
maximizing F 1 . F 1 = 2r/(kh + km)
kh is the length of the human summary, km is the length of the
machine generated summary, r is the number of sentences they
share in common.
Hidden Markov Models The HMM has two kinds of states, corresponding to summary sentences and non-summary sentences.
Features are:
- Position of the sentence in the document
- Position of the sentence within its paragraph
- Number of terms in the sentence -> log(number of terms+1)
- How likely the terms are, given a baseline of terms -> log(Pr(terms in i | baseline)
- How likely the terms are, given the document terms -> log(Pr(terms in i | document)
The model has 2s+1 states, with s summary states and s+1 non-summary states. The model has 2s free parameters (the arrows). The probability of transition between summary states 2j and 2j+2 is the number of times sentence j+1 directly followed summary sentence j in the training documents, divided by the number of documents.

The model has three parts:
- p, the initial state distribution
- M, the markov transition matrix
- B, the collection of multi-variant normal distributions associated with each state.
Finally, there are two method for extracting summaries with k sentences. The first one simply chooses those sentences with the maximum posterior probability of being a summary sentence. The second uses the QR decomposition described above to remove redundancies.
Event-Based Extractive Summarization
http://www.aclweb.org/anthology/W04-
Filatova and Vasileios describe the identification of concepts in the text as an important step in text summarization and propose a model of concepts based on atomic events. Atomic events are information about relevant named entities and the relationships between them--specifically, a pair of named entities connected by a verb or action-indicating noun. Each atomic event is given a score (importance) based on the relative frequency of the relation and connector. They compare their event-based algorithm with an algorithm that uses the (weighted) words that appear in the text (based on presence in titles, headers, being the first sentence, proximity to words like important or significant, etc).
Filatova and Vasileios model the extractive summarization problem as breaking the text into units, extracting the concepts covered by each unit, and choosing the most minimal units that cover all the most important concepts. They try several methods of getting the text units, a basic method that gets text units that cover many concepts (static greedy), and others which try to reduce redundancy in the selected units (adaptive greedy) or focus on (weigh more highly) units that cover the most important concepts over the less important ones (modified adaptive greedy). This study used sentences as text units, atomic events an
