Text summarization using NLP

Published in

Analytics Vidhya

4 min readApr 18, 2020

Text summarization is the process of generating short, fluent, and most importantly accurate summary of a respectively longer text document. The main idea behind automatic text summarization is to be able to find a short subset of the most essential information from the entire set and present it in a human-readable format. As online textual data grows, automatic text summarization methods have the potential to be very helpful because more useful information can be read in a short time.

Why automatic text summarization?

Summaries reduce reading time.
When researching documents, summaries make the selection process easier.
Automatic summarization improves the effectiveness of indexing.
Automatic summarization algorithms are less biased than human summarization.
Personalized summaries are useful in question-answering systems as they provide personalized information.
Using automatic or semi-automatic summarization systems enables commercial abstract services to increase the number of text documents they are able to process.

Type of summarization:

Based on input type:

Single Document, where the input length is short. Many of the early summarization systems dealt with single-document summarization.
Multi-Document, where the input can be arbitrarily long.

Based on the purpose:

Generic, where the model makes no assumptions about the domain or content of the text to be summarized and treats all inputs as homogeneous. The majority of the work that has been done revolves around generic summarization.
Domain-specific, where the model uses domain-specific knowledge to form a more accurate summary. For example, summarizing research papers of a specific domain, biomedical documents, etc.
Query-based, where the summary only contains information that answers natural language questions about the input text.

Based on output type:

Extractive, where important sentences are selected from the input text to form a summary. Most summarization approaches today are extractive in nature.
Abstractive, where the model forms its own phrases and sentences to offer a more coherent summary, like what a human would generate. This approach is definitely more appealing, but much more difficult than extractive summarization.

How to do text summarization

Text cleaning
Sentence tokenization
Word tokenization
Word-frequency table
Summarization

Text cleaning:

# !pip instlla -U spacy# !python -m spacy download en_core_web_smimport spacyfrom spacy.lang.en.stop_words import STOP_WORDSfrom string import punctuationstopwords = list(STOP_WORDS)nlp = spacy.load(‘en_core_web_sm’)doc = nlp(text)

Word tokenization:

tokens = [token.text for token in doc]print(tokens)punctuation = punctuation + ‘\n’punctuationword_frequencies = {}for word in doc:if word.text.lower() not in stopwords:if word.text.lower() not in punctuation:if word.text not in word_frequencies.keys():word_frequencies[word.text] = 1else:word_frequencies[word.text] += 1print(word_frequencies)

Sentence tokenization:

max_frequency = max(word_frequencies.values())max_frequencyfor word in word_frequencies.keys():word_frequencies[word] = word_frequencies[word]/max_frequencyprint(word_frequencies)sentence_tokens = [sent for sent in doc.sents]print(sentence_tokens)

Word frequency table:

sentence_scores = {}for sent in sentence_tokens:for word in sent:if word.text.lower() in word_frequencies.keys():if sent not in sentence_scores.keys():sentence_scores[sent] = word_frequencies[word.text.lower()]else:sentence_scores[sent] += word_frequencies[word.text.lower()]sentence_scores

Summarization:

from heapq import nlargestselect_length = int(len(sentence_tokens)*0.3)select_lengthsummary = nlargest(select_length, sentence_scores, key = sentence_scores.get)summaryfinal_summary = [word.text for word in summary]summary = ‘ ‘.join(final_summary)

Input:

text = “””Maria Sharapova has basically no friends as tennis players on the WTA Tour. The Russian player has no problems in openly speaking about it and in a recent interview she said: ‘I don’t really hide any feelings too much.I think everyone knows this is my job here. When I’m on the courts or when I’m on the court playing, I’m a competitor and I want to beat every single person whether they’re in the locker room or across the net.So I’m not the one to strike up a conversation about the weather and know that in the next few minutes I have to go and try to win a tennis match.I’m a pretty competitive girl. I say my hellos, but I’m not sending any players flowers as well. Uhm, I’m not really friendly or close to many players.I have not a lot of friends away from the courts.’ When she said she is not really close to a lot of players, is that something strategic that she is doing? Is it different on the men’s tour than the women’s tour? ‘No, not at all.I think just because you’re in the same sport doesn’t mean that you have to be friends with everyone just because you’re categorized, you’re a tennis player, so you’re going to get along with tennis players.I think every person has different interests. I have friends that have completely different jobs and interests, and I’ve met them in very different parts of my life.I think everyone just thinks because we’re tennis players we should be the greatest of friends. But ultimately tennis is just a very small part of what we do.There are so many other things that we’re interested in, that we do.’“””

Output(final summary): summary

I think just because you’re in the same sport doesn’t mean that you have to be friends with everyone just because you’re categorized, you’re a tennis player, so you’re going to get along with tennis players. Maria Sharapova has basically no friends as tennis players on the WTA Tour. I have friends that have completely different jobs and interests, and I’ve met them in very different parts of my life. I think everyone just thinks because we’re tennis players So I’m not the one to strike up a conversation about the weather and know that in the next few minutes I have to go and try to win a tennis match. When she said she is not really close to a lot of players, is that something strategic that she is doing?

For complete code check out my repo:

https://github.com/anoopsingh1996/Text_Summarization