Automatic summarization using Text Rank

Automatic summarization is the technique of shortening a text document with computer program to create a summary with the major ideas of the original content.

There are two types of automatic summarization: extraction and abstraction. Extractive methods work by selecting some existing words, phrases, or sentences from the original text to form the summary. On the other hand abstractive use natural language processing techniques to create a summary that is similar to what a human being can generate.

The basic steps are:

  • Split content into paragraphs.
  • From each paragraph choose the most suitable sentence i.e. sentence having highest rank.
  • Join all top ranked sentences to form the summary.

Let’s look at each step in detail.

Split content into paragraph

As all words in a sentences are separated by whitespace and new lines by ‘\n’, the paragraphs are separated by double new lines ‘\n\n’.

From each paragraph choose the most suitable sentence

So what is the most suitable sentence? Well, it the one with the highest rank. The sentence with highest rank is the one which has more common words in that paragraphs.
E.g.
His name is Bibhuti. And he is a person who writes code. Also, he write a blog. I code in Java Script too.
Here, the second sentence has the highest rank because it has more similar words i.e. “a”, “is”, “code”, “writes”.

To determine the rank, a dictionary (hash table) for each paragraph is needed. Its key will be the sentence itself and the value will be the count the intersection between the sentence and other sentences in the paragraphs.

The intersection is counted as:

But first let’s make a 2D array which will store the intersections for each paragraph with each other.

Then convert the 2d array into dictionary. Here the code i == j is for escaping the repetitions. The score is calculated by summing all intersections of a sentence with other sentence of the paragraph.

The dictionary for the above example looks like:

From the dictionary choose the sentence which has the maximum rank. i.e. And he is a person who writes code.

In similar manner all top ranked sentences can be chosen from all paragraphs to form a summary. The finished application is here.

The source code is on github at https://github.com/bibhuticoder/summaryjs.

Web Developer

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store