Text Summarization on Wikipedia pages using NLP

sangeetha natarajan
Analytics Vidhya
Published in
5 min readJul 30, 2021

This article covers practical implementation of both Extractive and Abstractive text summarization on Wikipedia pages using python.

Why Text Summarization?

Photo by Glen Carrie on Unsplash

With the ever-growing digital content and in today's automation world how cool would it be if instead of reading huge chunks of text data in the form of news articles or papers or ebooks we could grab all of its contents with just a few meaningful sentences?

This is where Text Summarization comes into play! Let us dig more about this below.

Text Summarization:

Text Summarization is one of the techniques used in NLP to create short meaningful collection of text called summaries from text resources like articles, books, research papers or even a webpage.

Types of Text Summarization Techniques:

Based on the way its created text summarization can be classified into two types namely,

  1. Extractive Summarization: In Extractive summarization, the most important sentences are chosen from the entire text data and are listed together as a summary.
  2. Abstractive Summarization: An abstractive summarization approach is more complex than extractive summarization. Here the summarizer first understands the main concepts of a document, and then generates new sentences which are not seen in the original document.

Extractive Summarization:

Let's see a few of the commonly used extractive summarization techniques.

Text Rank Algorithm:

A Text Rank Algorithm is performed by first splitting the text data into sentences and creating a similarity matrix. The similarity matrix contains the similarity between sentences and is then converted into a graph with sentences as vertices and similarity scores as edges, for sentence rank calculation. And the top-ranked sentences are listed as a summary.

Now let's see how to perform a Text Rank Algorithm using Python.

# Importing genism package and summarizer
import gensim
from gensim.summarization.summarizer import summarize
from gensim.summarization import keywords
import wikipedia

Now let's get the Wikipedia contents related to “Microsoft”

# Get wiki content.
wikisearch = wikipedia.page("Microsoft")
wikicontent = wikisearch.content
print(wikicontent)

Now let's summarize using TextRank Algorithm by creating a summary that is 0.1% of its original content.

# Summary by 0.1% of the original content
summary_ratio = summarize(wikicontent, ratio = 0.01)
print("summary by ratio")
print(summary_ratio)

Let's summarize by the count of words. Here we have used 200 words to create the summary.

#summary by count of words
summary_wordcount = summarize(wikicontent, word_count = 200)
print("summary by word count")
print(summary_wordcount)

Now let's try another extractive summarizer called LSA (Latent semantic analysis)

LSA (Latent semantic analysis):

LSA method extracts semantically significant sentences by information such as words that are used together and which common words are seen in different sentences. A high number of common words among sentences indicates that the sentences are semantically related. These semantically significant sentences are then listed together as a summary.

Now let's implement LSA on our above code.

# importing the summarizer
from sumy.summarizers.lsa import LsaSummarizer
# creating the LSA summarizer with summary of 5 sentences
lsa_summarizer=LsaSummarizer()
lsa_summary= lsa_summarizer(my_parser.document,5)
for sentence in lsa_summary:
print(sentence)

Although the above abstractive summaries are easier to implement and are helpful in highlight the key sentences from a long text document, some of these summaries might be harder to understand without adequate context.

In such cases where the context of the document is to be considered in generating meaningful and concise summaries, we could go for Abstractive Summarization methods.

Abstractive Summarization:

In abstractive summarization new sentences which best describe the entire document are listed as a summary.

In this article, we will discuss the BART Transformers pre-trained with the cnn news data.\

BART Transformers:

BART(Bidirectional and Auto-Regressive Transformers)uses a standard seq2seq machine translation architecture with a bidirectional encoder (similar to BERT) and a left-to-right decoder (similar to GPT). The pretraining task involves randomly shuffling the order of the original sentences and a novel in-filling scheme, where spans of text are replaced with a single mask token.

In BART input texts are firsts passed through the bidirectional encoder, i.e. the BERT-like encoder. Then the texts are looked at from left-to-right and right-to-left, and the subsequent output is used in the autoregressive decoder, which predicts the output based on the encoder input and the output tokens predicted so far.

More details about BART can be found here.

To implement BART in the above data, lets first install HuggingFace Transformers. This library runs on top of Pytorch and TensorFlow and can be use for a variety of language tasks like summarization, question answering etc.

# importing BART
from transformers import BartForConditionalGeneration, BartTokenizer, BartConfig

Now, let's load the BART model and tokenizer,

# Loading the model and tokenizer for bart-large-cnn
tokenizer=BartTokenizer.from_pretrained('facebook/bart-large-cnn')
model=BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')

Now lets feed the input text data to the tokenizer.batch_encode_plus() method and generate summary ids.

inputs = tokenizer.batch_encode_plus([wikicontent],return_tensors='pt',truncation=True)
summary_ids = model.generate(inputs['input_ids'], early_stopping=True)

Lets now decode the summary ids and print the summary output.

# Decoding and printing the summary
bart_summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print(bart_summary)

And the summary output looks as below,

Microsoft was founded by Bill Gates and Paul Allen on April 4, 1975, to develop and sell BASIC interpreters for the Altair 8800. It rose to dominate the personal computer operating system market with MS-DOS in the mid-1980s, followed by Microsoft Windows. The company's 1986 initial public offering (IPO) created three billionaires and an estimated 12,000 millionaires among Microsoft employees. In 2018 Microsoft reclaimed its position as the most valuable publicly traded company in the world. As of 2020, Microsoft has the third-highest global brand valuation.

Short, precise and looks much better than the previous summaries right?Do let me know your comments and feedbacks below.

Note:You can find the entire code in this GitHub Repo.

References:

  1. Abstractive Summarization on COVID-19 Publications with BART.http://cs230.stanford.edu/projects_spring_2020/reports/38866168.pdf
  2. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.https://arxiv.org/pdf/1810.04805.pdf
  3. https://huggingface.co/transformers/model_doc/bart.html#transformers.BartForConditionalGeneration

--

--