Article Summarizer

This project implements a custom algorithm to extract the most important sentences and keywords from Spanish and English news articles.

It was fully developed in Python and it is inspired by similar projects seen on Reddit news subreddits that use the term frequency–inverse document frequency (tf–idf).

The 3 most important files are:

scraper.py : A Python script that performs web scraping on a given HTML source, it extracts the article title, date and body.
summary.py : A Python script that applies a custom algorithm to a string of text and extracts the top ranked sentences and words.
bot.py : A Reddit bot that checks a subreddit for its latest submissions. It manages a list of already processed submissions to avoid duplicates.

Requirements

This project uses the following Python libraries

spaCy : Used to tokenize the article into sentences and words.
PRAW : Makes the use of the Reddit API very easy.
Requests : To perform HTTP get requests to the articles urls.
BeautifulSoup : Used for extracting the article text.
html5lib : This parser got better compatibility when used with BeautifulSoup.
tldextract : Used to extract the domain from an url.
wordcloud : Used to create word clouds with the article text.

After installing the spaCy library you must install a language model to be able to tokenize the article.

For Spanish you can run this one:

python -m spacy download es_core_news_sm

For other languages please check the following link: https://spacy.io/usage/models

Reddit Bot

The bot is simple in nature, it uses the PRAW library which is very straightforward to use. The bot polls a subreddit every 10 minutes to get its latest submissions.

It first detects if the submission hasn't already been processed and then checks if the submission url is in the whitelist. This whitelist is currently curated by myself.

If the post and its url passes both checks then a process of web scraping is applied to the url, this is where things start getting interesting.

Before replying to the original submission it checks the percentage of the reduction achieved, if it's too low or too high it skips it and moves to the next submission.

Web Scraper

Currently in the whitelist there are already more than 300 different websites of news articles and blogs. Creating specialized web scrapers for each one is simply not feasible.

The second best thing to do is to make the scraper as accurate as possible.

We start the web scraper on the usual way, with the Requests and BeautifulSoup libraries.

with requests.get(article_url) as response:
    
    if response.encoding == "ISO-8859-1":
        response.encoding = "utf-8"

    html_source = response.text

for item in ["</p>", "</blockquote>", "</div>", "</h2>", "</h3>"]:
    html_source = html_source.replace(item, item+"\n")

soup = BeautifulSoup(html_source, "html5lib")

Very few times I got encoding issues caused by an incorrect encoding guess. To avoid this issue I force Requests to decode with utf-8.

Now that we have our article parsed into a soup object we will start by extracting the title and published time.

I used similar methods to extract both values, I first check the most common tags and fallback to the next common alternatives.

Not all websites expose their published date, we sometimes end with an empty string.

article_title = soup.find("title").text.replace("\n", " ").strip()

# If our title is too short we fallback to the first h1 tag.
if len(article_title) <= 5:
    article_title = soup.find("h1").text.replace("\n", " ").strip()

article_date = ""

# We look for the first meta tag that has the word 'time' in it.
for item in soup.find_all("meta"):

    if "time" in item.get("property", ""):

        clean_date = item["content"].split("+")[0].replace("Z", "")
        
        # Use your preferred time formatting.
        article_date = "{:%d-%m-%Y a las %H:%M:%S}".format(
            datetime.fromisoformat(clean_date))
        break

# If we didn't find any meta tag with a datetime we look for a 'time' tag.
if len(article_date) <= 5:
    try:
        article_date = soup.find("time").text.strip()
    except:
        pass

When extracting the text from different tags I often got the strings without separation. I implemented a little hack to add new lines to each tag that usually contains text. This significantly improved the overall accuracy of the tokenizer.

My original idea was to only accept websites that used the <article> tag. It worked ok for the first websites I tested, but I soon realized that very few websites use it and those who use it don't use it correctly.

article = soup.find("article").text

When accessing the .text property of the <article> tag I noticed I was also getting the JavaScript code. I backtracked a bit and removed all tags which could add noise to the article text.

[tag.extract() for tag in soup.find_all(
        ["script", "img", "ul", "time", "h1", "h2", "h3", "iframe", "style", "form", "footer", "figcaption"])]


# These class names/ids are known to add noise or duplicate text to the article.
noisy_names = ["image", "img", "video", "subheadline",
                "hidden", "tract", "caption", "tweet", "expert"]

for tag in soup.find_all("div"):

    tag_id = tag["id"].lower()

    for item in noisy_names:
        if item in tag_id:
            tag.extract()

The above code removed most captions, which usually repeat what's inside in the article.

After that I applied a 3 step process to get the article text.

First I checked all <article> tags and grabbed the one with the longest text.

article = ""

for article_tag in soup.find_all("article"):

    if len(article_tag.text) >= len(article):
        article = article_tag.text

That worked fine for websites that properly used the <article> tag. The longest tag almost always contains the main article.

But that didn't quite worked as expected, I noticed poor quality on the results, sometimes I was getting excerpts for other articles.

That's when I decided to add the fallback, lnstead of only looking for the <article> tag I will be looking for <div> and <section> tags with commonly used id's.

# These names commonly hold the article text.
common_names = ["artic", "summary", "cont", "note", "cuerpo", "body"]

# If the article is too short we look somewhere else.
if len(article) <= 650:

    for tag in soup.find_all(["div", "section"]):

        tag_id = tag["id"].lower()

        for item in common_names:
            if item in tag_id:
                # We guarantee to get the longest div.
                if len(tag.text) >= len(article):
                    article = tag.text

That increased the accuracy quite a bit, I repeated the code but instead of the id attribute I was also looking for the class attribute.

# The article is still too short, let's try one more time.
if len(article) <= 650:

    for tag in soup.find_all(["div", "section"]):

        tag_class = "".join(tag["class"]).lower()

        for item in common_names:
            if item in tag_class:
                # We guarantee to get the longest div.
                if len(tag.text) >= len(article):
                    article = tag.text

Using all the previous methods greatly increased the overall accuracy of the scraper. In some cases I used partial words that share the same letters in English and Spanish (artic -> article/articulo). The scraper was now compatible with all the urls I tested.

We make a final check and if the article is still too short we abort the process and move to the next url, otherwise we move to the summary algorithm.

Summary Algorithm

This algorithm was designed to work primarily on Spanish written articles. It consists on several steps:

Reformat and clean the original article by removing all whitespaces.
Make a copy of the original article and remove all common used words from it.
Split the copied article into words and score each word.
Split the original article into sentences and score each sentence using the scores from the words.
Take the top 5 sentences and top 5 words and return them in chronological order.

Before starting out we need to initialize the spaCy library.

NLP = spacy.load("es_core_news_sm")

That line of code will load the Spanish model which I use the most. If you are using another language please refer to the Requirements section so you know how to install the appropriate model.

Clean the Article

When extracting the text from the article we usually get a lot of whitespaces, mostly from line breaks (\n).

We split the text by that character, then strip all whitespaces and join it again. This is not strictly required to do but helps a lot while debugging the whole process.

Remove Common and Stop Words

At the top of the script we declare the path of the stop words text files. These stop words will be added to a set, guaranteeing no duplicates.

I also added a list with some Spanish and English words that are not stop words but they don't add anything substantial to the article. My personal preference was to hard code them in lowercase form.

Then I added a copy of each word in uppercase and title form. Which means the set will be 3 times the original size.

with open(ES_STOPWORDS_FILE, "r", encoding="utf-8") as temp_file:
    for word in temp_file.read().splitlines():
        COMMON_WORDS.add(word)

with open(EN_STOPWORDS_FILE, "r", encoding="utf-8") as temp_file:
    for word in temp_file.read().splitlines():
        COMMON_WORDS.add(word)

extra_words = list()

for word in COMMON_WORDS:
    extra_words.append(word.title())
    extra_words.append(word.upper())

for word in extra_words:
    COMMON_WORDS.add(word)

Scoring Words

Before starting tokenizing our words we must first p

Summarizer

Install / Use

README