Text summarization is one of famous NLP application which had been researched a lot and still at its nascent stage compared to manual summarization. In simple terms, the objective is to condense unstructured text of an article into a summary automatically. There are two types of summarization techniques.

  1. Extractive Summarization
  2. Abstractive Summarization

Extractive Summarization

Extractive summarization is extracting the most important sentences from whole document which would coherently represent the document. The number of sentences picked may depend on the compression ratio of summary. There could be multiple approaches depending on the summarization application. For example,

  • For summarizing the reviews, we may want to pick highly positive and negative sentences as summary.
  • Another approach could be picking only the sentences which contains some object/entity (Company name, person, dates, location etc).
  • Also, if we want to summarize a report or an article, we may want to pick the important or say main sentences from the article. Here, the problem of summarization converges to the problem “How do we know the main/important sentences (Text Rank algorithm)”. This problem is very similar to “How does google searches for important web pages from so many web pages having similar content” (Page Rank algorithm). It is an unsupervised approach.

Abstractive Summarization

This is a hard problem. We want to produce a summary which corresponds to how a professional writer would summarize an article. Abstractive summarizers are generative models as it may come up with a summary containing new words (as opposed to picking existing sentences). Also, an abstractive summarization requires supervised learning procedure to learn how a professionally supervised summary looks like given long articles. There are lot of deep learning models which had been researched for this purpose. Interested readers can check the links here, here and here. Still, it is an active area of research.

This blog post demonstrate extractive summarization using TextRank algorithm (similar to pagerank) in an unsupervised manner. The underlying metric for calculation of ranking is similarity matrix which is similarity of each sentence with all the other sentence in article. We use glove vectors to calculating similarity scores among sentences.

1. Imports

I downloaded the pretrained glove vectors (1.9M vocab words, 300 dim) from here.

import os
import re
import string
import numpy as np
import networkx as nx
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
from sklearn.metrics.pairwise import cosine_similarity

gloveFile = "../glove.42B.300d.txt"

2. Custom Functions: Preprocessing

Below are the 3 python functions for:

  • removing non-ascii characters,
  • preprocessing and cleaning stop words, punctuation marks and digits
  • creating and loading word to glove embedding dictionary.
stop = set(stopwords.words('english'))
exclude = set(string.punctuation)
lemma = WordNetLemmatizer()

def rem_ascii(s):
    return "".join([c for c in s if ord(c) < 128 ])

# Cleaning the text sentences so that punctuation marks, stop words and digits are removed.
def clean(doc):
    stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
    punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
    processed = re.sub(r"\d+","",punc_free)
    return processed

def loadGloveModel(gloveFile):
    word_embeddings = {}
    f = open(gloveFile, encoding='utf-8')
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        word_embeddings[word] = coefs
    return word_embeddings

word_embeddings = loadGloveModel(gloveFile)

The above steps are important as we want to only choose important words of sentence for similarity scoring. Readers can check the vocabulary size of this glove model below (1.9M).

print("Vocab Size = ",len(word_embeddings))
Vocab Size =  1917494

Next, we chose a news article which talks about blocking of customer debit cards of certain banks from year 2020 to meet international payment standards. We used NLTK sentence tokenizer to tokenize the article paragraph into sentences.

text = """SBI, PNB, HDFC, ICICI,  Debit Card Customers: If you are using a bank debit card which doesn't have EMV (Europay, Mastercard and Visa), then you may have to face a problem during money withdrawal from the ATM after 31st December 2019 as your debit card may be blocked from 1st january 2020. According to the Reserve Bank of India (RBI) guidelines, all Indian banks need to replace the magnetic debit card of their customers with a new EMV card. This debit card replacement is mandatory as it is aimed at meeting the international payment standards. Hence, SBI, PNB, HDFC Bank, ICICI Bank or any other bank customers who are using a magnetic debit card are advised to replace their debit card otherwise they will have to face difficulty in money withdrawal from the ATM. The RBI guidelines say all Indian banks will have to replace all magnetic chip-based debit cards with EMV and PIN-based cards by 31st December 2019. Keeping in view of the continuing online frauds on magnetic stripe cards, the RBI has proposed to deactivate them by 31st December 2019. So, all magnetic chip-based debit cards will be deactivated from 1st January 2020 (irrespective of the validity of the existing magnetic SBI debit cards). All banks are sending messages to their customers via various means asking them to replace their magnetic chip-based debit card by a new EMV debit card. The SBI warned its debit cardholders through a tweet citing, "Apply now to change your Magnetic Stripe Debit Cards to the more secure EMV Chip and PIN-based SBI Debit card at your home branch by 31st December 2019. Safeguard yourself with guaranteed authenticity, greater security for online payments and added security against fraud". So, the SBI has made it clear that SBI debit card will be blocked if it is a magnetic card. In fact, the SBI has already started deactivating the SBI cards of those SBI accounts in which PAN or Form 60 is not updated."""
sentences = sent_tokenize(text)
cleaned_texts = [rem_ascii(clean(sentence)) for sentence in sentences]
Text Article

3. Sentence Embeddings from Glove

Idea is very simple. We simply calculate the average of embeddings of all the words present the sentence. We have already removed stop words, digits etc which do not contribute to similarity scoring.

sentence_vectors = []
for i in cleaned_texts:
    if len(i) != 0:
        v = sum([word_embeddings.get(w, np.zeros((dim,))) for w in i.split()])/(len(i.split())+0.001)
        v = np.zeros((dim,))

4. Similarity Matrix

We have 300 dimensional vector for each sentence of article now. We create a similarity matrix which keeps cosine distance of each sentences to every other sentence. It is obvious that the matrix is symmetric in nature. Below codes produces matrix and graph to display how a similarity matrix would look like.

sim_mat = np.zeros([len(cleaned_texts), len(cleaned_texts)])
for i in range(len(sentences)):
    for j in range(len(sentences)):
        if i != j:
            sim_mat[i][j] = cosine_similarity(sentence_vectors[i].reshape(1,dim),sentence_vectors[j].reshape(1,dim))[0,0]
sim_mat = np.round(sim_mat,3)

# Creating the network graph
nx_graph = nx.from_numpy_array(sim_mat)
plt.figure(figsize=(10, 10))
pos = nx.spring_layout(nx_graph)
nx.draw(nx_graph, with_labels=True, font_weight='bold')
Similarity Matrix
Similarity Graph (NetworkX  pypi)

5. Text Rank Algorithm

Text rank algorithm is inspired from the famous Google’s PageRank algorithm. The underlying assumption of PageRank algorithm is that more important websites are likely to receive more links from other websites. It counts the number and quality of links to a page to determine a rough estimate of how important the website is. An illustrative image (taken from wikipedia) showing the importance of webpage can be seen below.

PageRank illustration

Similar to PageRank, The underlying assumption of text rank algorithm is that the summary sentences are similar (or linked) to most of the other sentences. Essentially, it runs PageRank on similarity graph designed in previous section. It builds a graph using some set of text sentences as vertices. Edges are based on measure of semantic similarity (glove embedding based) between the text sentence vertices.

scores = nx.pagerank(nx_graph)
12 sentences: Scores from PageRank

6. Result

As we got the ranked score of all the 12 sentences in the article. There are simply two steps which had been implemented next.

  1. Sort the sentences in decreasing order of their score. (Also, preserve the index of sentences).
  2. Choose half of the sentences from top (Compression ratio 0.5) and sort it again based on their increasing indices. This will ensure that summary is coherent with original article (picked sentences are in order).
ranked_sentences = sorted(((scores[i],i) for i,s in enumerate(sentences)), reverse=True)
arranged_sentences = sorted(ranked_sentences[0:int(len(sentences)*0.5)], key=lambda x:x[1])
print("\n".join([sentences[x[1]] for x in arranged_sentences]))
Summarized Text (Compression = 50%)

At the End

The application of text summarization could be limitless. The blog-post demonstrates one technique to do so. Also, this is not enough as it only picks the main (kind of) sentences. Abstractive summarization is a way to go but it’s at early stage and an activate area of research. As far as sentence based extractive summarization is concerned, The similarity measure among sentences could be one of the various metrics available. For example,

  • tf-idf score
  • word2vec based similarity
  • doc2vec (sentence to vec)
  • word embeddings from deep learning models like ELMO or BERT based.

If you liked the post, follow this blog to get updates about upcoming articles. Also, share it so that it can reach out to the readers who can actually gain from this. Please feel free to discuss anything regarding the post. I would love to hear feedback from you.

Happy machine learning 🙂