Text Summarization Guide: Exploratory Data Analysis on Text Data

Published in

The Startup

12 min readDec 29, 2020

Part 1 of a series about text summarization using machine learning techniques

During the execution of my capstone project in the Machine Learning Engineer Nanodegree in Udacity, I studied in some depth about the problem of text summarization. For that reason, I am going to write a series of articles about it, from the definition of the problem to some approaches to solve it, showing some basic implementations and algorithms and describing and testing some more advanced techniques. It will take me some posts for the next few weeks or months.
I will also take advantage of powerful tools like Amazon SageMaker containers, hyperparameter tuning, transformers and Weights & Biases logging to show you how to use them to improve and evaluate the performance of the models.

As a summary, some of the posts will introduce concepts like:

Exploratory Data Analysis for text, to dive deeper in the features of the text and its distribution of words.
Extractive solutions: Using a simple function from a popular library, gensim, and a Sentence clustering algorithm.
Abstractive summarization using LSTMs and the attention mechanism
The Pointer Generation network, an extension from an encoder-decoder, a mix between extractive and abstractive algorithms.
The Transformer model, extending the attention concept to an initial solution.
Advanced transformer models like T5 or BART from the fantastic library, transformers by Huggingface.
Etc,…

Problem Statement

Text Summarization is a challenging problem these days and it can be defined as a technique of shortening a long piece of text to create a coherent and fluent short summary having only the main points in the document.

But, what is a summary? It is a “text that is produced from one or more texts, that contains a significant portion of the information in the original text(s), and that is no longer than half of the original text(s) . Summarization clearly involves both these still poorly understood processes, and adds a third (condensation, abstraction, generalization)” [3]. Or as it is described in [4], text summarization is “the process of distilling the most important information from a source (or sources) to produce an abridged version for a particular user (or user)and task (or tasks).”

At this moment, it is a very active field of research and the state-of-the-art solutions are still not so successful than we could expect.

Exploratory Data Analysis

We can read this concept definition on Wikipedia: ”exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods”. This step is absolutely necessary and has a huge impact on the final result of a model, this analysis will tell us what type of transformation we need to apply to our data.

When we face a machine learning problem, we must begin by analyzing the input dataset, seeking to identify its characteristics and anomalies. In this post we will explore the dataset, analyze the main features and characteristics of the data and visualize some figures to help us understand how our dataset looks like and how we need to process it for a better performance of our algorithms.

We should spend enough time diving on the data to extract information, or even knowledge, about the scope of our task and an intuition on how to handle and transform it. We will apply and use some commons techniques on data exploration and visualization.

At the end of the post you can find a link to the notebook with the code and figures we are going to describe.

Data Definition

For this text summarization problem, we will use a dataset from Kaggle, called Inshorts News Data. Inshorts is a news service that provides short summaries of news from around the web, scraping the news article from Hindu, Indian times and Guardian. This dataset contains headlines and summary of news items, about 55,000, along with its source.

An example:

Text: “TV news anchor Arnab Goswami has said he was told he could not do the programme two days before leaving Times Now. 18th November was my last day, I was not allowed to enter my own studio, Goswami added. When you build an institution and are not allowed to enter your own studio, you feel sad, the journalist further said”.
Summary: “Was stopped from entering my own studio at Times Now: Arnab”

Code to load and prepare the dataset:

Descriptive statistics

First of all, we extract the basics statistics: count of rows, unique rows, frequencies,… This atributes in the text data will tell us if we need to remove repeated rows or how many rows contains null values in any column.

Output from a Pandas Dataframe describe method

We can observe that there are two variables describing the text of the news:

Short : this is the longest variable and contains the text of the news
Headline : this is a summary or highlights composed by one or two

In our project we will work with the Short variable as the text feature and the Headline variable will be our target summary.

Reading the previous table we can observe that there are some rows with a null value or a repeated one in the variable Short. So our first step would be to drop those rows. Our problem is not interested in any others variables except the text or full text of the article and the corresponding summary, so we can remove the Source, Date, etc.

Exploring relevant features in the data

The dataset contains only the two columns of interest — summary and text. In this section we will create some additional features that provide relevant information about the composition of our texts. The following list explains different ideas for creating new features:

Statistical Count Features from headlines and text that we are going to explore:

Sentence Count — Total number of sentences in the text
Word Count — Total number of words in the text
Character Count — Total number of characters in the text excluding spaces
Sentence density — Number of sentences relative to the number of words
Word Density — Average length of the words used in the headline
Punctuation Count — Total number of punctuations used in the headline
Stopwords Count — Total number of common stopwords in the text

Then, we calculate these features on our dataset (a Pandas Dataframe containing the source text and the target summary).

Analyze the feature distributions on the text variable

Now that we have calculated the new features, we can analyze the descriptive statistics to identify the main insights on the data distribution and outliers.

Summarizing:

3 sentences, 60 words and 311 characters per row are the mean values and they are very close to the median values.
Standard deviations are quite small.
We observe that the maximum number of sentences (9) and chars (more than 400) are far away from the mean values, indicating that there are some registers with values out of range or outliers.

The next code functions plot figures to help us visualize the features:

Let’s dive in these features to get a better understanding by plotting some figures:

Above we have plotted the histograms of our features. The count of rows with outliers values is a small number, we can consider removing them but it does not look a great deal. The count of words looks like a left skewed distribution, 75% of rows in the range 55–60 words and the count of char is a normal distribution. We do not identify weird examples or data distributions.

Analyze the feature distributions on the summary variable

We repeat the previous step for the summary field:

Now, the distributions look like normal and no anomaly is observed:

The distribution of words and sentences are close to the mean value and the standard deviation is relatively small.
Most summaries are composed by 1 sentence and the number of words is very close to 7–8.
The number of chars are mostly between 40 and 50.
There is only 1 or 2 records with large values. Outliers is not a problem, we can remove them.

Categorizing and POS tagging words

Another group of features we can inspect in text data are the Part-Of-Speech tagging:

The process of classifying words into their parts of speech and labeling them accordingly is known as part-of-speech tagging, POS-tagging, or simply tagging. Parts of speech are also known as word classes or lexical categories.
Natural Language Processing with Python, by S. Bird, E. Klein and E. Loper[1]

Our target in the next section is to identify the POS tags and analyze its distribution on the dataset. Every word in a text is tagged as a noun, determiner, adjetive, adverb,… Maybe we can observe any interesting behavior but it is not frequent, usually, the distribution of the tags is coherent and depends on the domain or context.

To help us with this task, the NLTK library defines a function pos_tag which receive a list of words as input and return the part-of-speech tag of every word.

Then we can plot a histogram to check the distribution:

Check for Unknown words

It may be very common that unknown words are included in our texts and summaries, consequently, we should analyze them and you probably have to define how to deal with them. Most of the unknown words are names, surnames, locations or even misspelled words, which we have to decide to correct or not them.

In order to search for these words, we need a vocabulary to compare to. In this case, we use the Glove embeddings, checking if our words are included in those embeddings.

We can show the distribution of the unknown words in our texts to get a fast insight of their relevance:

In our source texts the mean is 0.8 and 75% of the sentences are below 1. We can conclude that they are not an issue.
In the case of the summary variable, unknown words are not present.
There are 1 or 2 examples with unknown words, we can remove them or just ignore them.

Use of stopwords and punctuations

Now that we have a more accurate vision of the composition of our texts, we need to analyze the use of stopwords and punctuation, this analysis will indicate us if these “special type” of characters will be removed or transform when we train our models.

As we did previously, the NLTK library provide us with a list of stopwords for english texts, so we can look for them in our dataset. Now, lets explore the histograms of the count of stopwords and punctuations, to get a better intuition about the texts we are going to work with.

Figures are normal, do not show hidden or unexpected patterns

Most frequent terms and Wordclouds

The domain or context of our texts will determine the most frequent words, therefore, it is important to verify what those words are and thus identify the domains and confirm that they are the expected ones.

“A Wordcloud (or Tag cloud) is a visual representation of text data. It displays a list of words and the importance of each beeing shown with font size or color (the bigger the more frequent). This format is useful for quickly perceiving the most relevant terms on a document or set of documents.”
Wikipedia

We will draw the wordcloud for the source texts and the summaries to compare if they are very similar, it will allow us to check that the relevant concepts have been correctly extracted in the summaries.

Build the wordcloud for source texts:

As it could be expected, they both contain almost the same words: india, first, india, day, time. There are some exceptions but they both share the same domain.

Topic Modelling

In our next step, we will learn how to identity which topics are discussed in our texts, this process is called topic modelling. “In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract “topics” that occur in a collection of documents” by Wikipedia. Using a unsupervised technique, this method tries to find semantic structures in a text in order to classify groups of related words in a topic representation.

In particular, we will cover Latent Dirichlet Allocation (LDA): a widely used topic modelling technique. And we will apply LDA to convert our set of source texts to a set of topics.

There are several existing algorithms you can use to perform the topic modeling. The most common of it are, Latent Semantic Analysis (LSA/LSI), Probabilistic Latent Semantic Analysis (pLSA), and Latent Dirichlet Allocation (LDA)

[5], “Topic Modeling in Python: Latent Dirichlet Allocation (LDA)” by Shashank Kapadia.

But first, we need to transform our text data to a format that will serve as an input to the LDA model. We convert the text to a vector representation where each word is replaced by an integer. In this case we will apply the CountVectorized method. The Count Vectorized method replace each word by the count of occurrences of the word in our corpus, group of texts. The result is a document term matrix where most frequents words are assigned with a higher value.

The Scikit-learn library provides a function to calculate LDA and return a list with the topic and the tokens that include. In our example, we define 10 topics to discover and show the main 8 elements or words in each one:

In our dataset, all news are closely related and include terms like said, india, people. It is hard to clearly identify the topics, but for example, number 4 looks like it is referencing to politicians and number 3 to a sport competition.

Visualizing the topic modelling results

Visualizing the topics will help us to interpret them, the pyLDAvis library plots an interactive figure where each circle corresponds to a topic. The size of the circle represents its importance in the texts and the distance between each circle indicates how similar they are. You can select a topic to display the most relevant words for the topic and the frequency of each word appearing in the topic and the corpus.

The parameter relevance metric λ distinguishes words which are exclusive to the topic (closer to 0) and words with high probability of being included in the selected topic (closer to 1). Playing around with this parameter can help us to assign a “name” to a topic.

In the figure shown, we can select a topic and review which relevants words include and its frequency in the dataset. It is easy to compare topics, and search for anomalies or points to mention.

All code is publicly avalaible in a Jupyter Notebook in my github repository.

I hope you find interesting and usefull this post and the concepts explained.

References

[1]- Natural Language Processing with Python, by Steven Bird, Ewan Klein and Edward Loper, 2019.

[3]- Hovy, E. H. Automated Text Summarization. In R. Mitkov (ed), The Oxford Handbook of Computational Linguistics, chapter 32, pages 583–598. Oxford University Press, 2005

[4]- Mani, I., House, D., Klein, G., et al. The TIPSTER SUMMAC Text Summarization Evaluation. In Proceedings of EACL, 1999.

[5]- Shashank Kapadia, “Topic Modeling in Python: Latent Dirichlet Allocation (LDA)” 2019 Medium post.

[6]- Susan Li, “Topic Modelling in Python with NLTK and Gensim” 2018 Medium post