SkillAgentSearch skills...

Sumgram

sumgram is a tool that summarizes a collection of text documents by generating the most frequent sumgrams (conjoined ngrams)

Install / Use

/learn @oduwsdl/Sumgram
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

sumgram

sumgram (see blogpost) is a tool that summarizes a collection of text documents by generating the most frequent "sumgrams" (conjoined ngrams) in the collection. Sumgrams are higher-order ngrams (e.g., "world health organization") generated by conjoining lower-order ngrams (e.g., "world health" and "health organization"). Unlike convention ngram generators that split multi-word proper nouns, sumgram works hard to avoid this by applying two (pos_glue_split_ngrams and mvg_window_glue_split_ngrams) algorithms. These algorithms enable sumgram to generate conjoined ngrams, or sumgrams of different ngram classes (bigrams, trigrams, k-grams, etc.) as part of the summary, instead of limiting the summary to a single ngram class (e.g., bigrams).

From Fig. 1, the six-gram "centers for disease control and prevention" was split (stopwords removed) into 3 different bigrams ("centers disease," "disease control," and "control prevention") by a conventional algorithm that generates bigrams. But sumgram detected and "glued" such split ngrams.

Fig. 1: Comparison of top 20 (first column) bigrams, top 20 (second column) six-grams, and top 20 (third column) sumgrams (conjoined ngrams) generated by sumgram for a collection of documents about the 2014 Ebola Virus Outbreak. Proper nouns of more than two words (e.g., "centers for disease control and prevention") are split when generating bigrams, sumgram strives to remedy this. Generating six-grams surfaces non-salient six-grams. <img src="pics/sumgrams_ebola.png" alt="ebola virus ngrams vs sumgrams" style="width: 50%;"/>

Fig. 2: Comparison of top 20 (first column) bigrams, top 20 (second column) six-grams, and top 20 (third column) sumgrams (conjoined ngrams) generated by sumgram for a collection of documents about Hurricane Harvey. Proper nouns of more than two words (e.g., "federal emergency management agency") are split when generating bigrams, sumgram strives to remedy this. Generating six-grams surfaces non-salient six-grams. <img src="pics/sumgrams_harvey.png" alt="hurricane harvey ngrams vs sumgrams" style="width: 50%;"/>

Citing Project

A publication related to this project appeared in the proceedings of ACM Conference on Hypertext and Social Media 2018 (Read the PDF). Please cite it as below:

Nwala, Alexander C., Michele C. Weigle, and Michael L. Nelson. "Bootstrapping web archive collections from social media." in Proceedings of ACM Conference on Hypertext and Social Media, pp. 64-72. 2018.

@inproceedings{ht-2018:nwala,
  author    = {Nwala, Alexander C and Weigle, Michele C and Nelson, Michael L},
  title     = {{Bootstrapping Web Archive Collections from Social Media}},
  booktitle = {Proceedings of ACM Conference on Hypertext and Social Media (HT 2018)},
  series    = {HT '18},
  year      = {2018},
  month     = {jul},
  location  = {Baltimore, Maryland, USA},
  pages     = {64--72},
  numpages  = {9},
  url       = {https://doi.org/10.1145/3209542.3209560},
  doi       = {10.1145/3209542.3209560},
  isbn      = {9781450354271},
  publisher = {ACM},
  address   = {New York, NY, USA}
}

Installation

Just type

$ pip install sumgram

OR

$ git clone https://github.com/oduwsdl/sumgram.git
$ cd sumgram; pip install .; cd ..; rm -rf sumgram;

OR install/run within Docker container

$ docker run -it --rm --name Sumgram -v "$PWD":/usr/src/myapp -w /usr/src/myapp python:3.7-stretch bash
$ pip install sumgram

OR install/run in locally built docker image

$ git clone https://github.com/oduwsdl/sumgram.git
$ cd sumgram
$ docker build -t wsdl/sumgram .
$ cd ..; rm -rf sumgram;
$ docker run --rm -it -v "$PWD":/data/ wsdl/sumgram

OR install/run from Dockerhub: coming soon

Usage

Basic usage:

  • $ sumgram path/to/collection/of/text/files/, e.g., sumgram tests/unit/sample_cols/harvey
  • $ sumgram single_file.txt, e.g., sumgram tests/unit/sample_cols/harvey/single_file.txt
  • $ sumgram https://www.example.com/news/article-1.html https://www.example.com/news/article-2.html
  • $ sumgram path/to/collection/ file2.txt file3.txt https://www.example.com/news/article-1.html
  • $ cat path/to/collection/of/text/files/*.txt | sumgram -

Python script usage:

Command line options may be activated by setting the argument in the params dictionary passed as an argument to get_top_sumgrams(). To set a command line argument, consider the following transformation example:

params = {}
params['sentences_rank_count'] = 20  #For command line argument --sentences-rank-count

The following is a Python script example illustrating the use of sumgram done by calling the get_top_sumgrams() function.

import json
from sumgram.sumgram import get_top_sumgrams

doc_lst = [
    {'id': 0, 'text': 'The eye of Category 4 Hurricane Harvey is now over Aransas Bay. A station at Aransas Pass run by the Texas Coastal Observing Network recently reported a sustained wind of 102 mph with a gust to 132 mph. A station at Aransas Wildlife Refuge run by the Texas Coastal Observing Network recently reported a sustained wind of 75 mph with a gust to 99 mph. A station at Rockport reported a pressure of 945 mb on the western side of the eye.'},
    {'id': 1, 'text': 'Eye of Category 4 Hurricane Harvey is almost onshore. A station at Aransas Pass run by the Texas Coastal Observing Network recently reported a sustained wind of 102 mph with a gust to 120 mph.'},
    {'id': 2, 'text': 'Hurricane Harvey has become a Category 4 storm with maximum sustained winds of 130 mph. Sustained hurricane-force winds are spreading onto the middle Texas coast.'}
  ]

'''
  Use 'add_stopwords' to include list of additional stopwords not included in stopwords list (https://github.com/oduwsdl/sumgram/blob/0224fc9d54034a25e296dd1c43c09c76244fc3c2/sumgram/util.py#L31)
'''
params = {
    'top_sumgram_count': 10,
    'add_stopwords': ['image'],
    'no_rank_sentences': True,
    'title': 'Top sumgrams for Hurricane Harvey text collection'
}

ngram = 2
sumgrams = get_top_sumgrams(doc_lst, ngram, params=params)
with open('sumgrams.json', 'w') as outfile:
  json.dump(sumgrams, outfile)

Examples (see sample collection tests/unit/sample_cols/harvey):

Generate top 10 (t = 10) sumgrams for the Archive-It Ebola Virus Collection:

$ sumgram -t 10 cols/ebola/
 rank  sumgram                                              DF   DF-Rate
  1    in west africa                                       50    0.35 
  2    liberia and sierra leone                             46    0.33 
  3    ebola virus                                          44    0.31 
  4    ebola outbreak                                       41    0.29 
  5    public health                                        40    0.28 
  6    the centers for disease control and prevention       23    0.16 
  7    the united states                                    23    0.16 
  8    the world health organization                        22    0.16 
  9    ebola patients                                       20    0.14 
  10   health workers                                       20    0.14 

Generate top 10 (t = 10) sumgrams for the Archive-It Hurricane Harvey Collection:

$ sumgram -t 10 cols/harvey/
rank   sumgram                                              DF   DF-Rate
  1    hurricane harvey                                     20    0.47 
  2    tropical storm harvey                                10    0.23 
  3    2017 houston transtar inc.                           9     0.21 
  4    2017. photo                                          9     0.21 
  5    corpus christi                                       9     0.21 
  6    image 28 of                                          9     0.21 
  7    image 29 of                                          9     0.21 
  8    image 30 of                                          9     0.21 
  9    image 31 of                                          9     0.21 
  10   image 32 of                                          9     0.21

This collection has lots of images, but the "image" term might obscure more salient ngrams, so let's rerun the command, but this time consider "image" a stopword (--add-stopwords="image"). As seen below such modification exposed more salient bigrams such as "buffalo bayou" and "coast guard". The argument of --add-stopwords is a comma-separated string of stopwords (e.g., "image, photo, image of"). Use this parameter to add domain specific stopwords not included in sumgram's default stopwords list.

$ sumgram -t 10 --add-stopwords="image" cols/harvey/
 rank  sumgram                                              DF   DF-Rate
  1    hurricane harvey                                     20    0.47 
  2    tropical storm harvey                                10    0.23 
  3    2017 houston transtar inc.                           9     0.21 
  4    2017. photo                                          9     0.21 
  5    corpus christi                                       9     0.21 
  6    texas photo                                          9     0.21 
  7    27, 2017                                             8     0.19 
  8    buffalo bayou     
View on GitHub
GitHub Stars56
CategoryDevelopment
Updated9mo ago
Forks14

Languages

Python

Security Score

87/100

Audited on Jun 27, 2025

No findings