Abstractive Text Summarization with Natural Language Processing

Mars Xiang

Follow

Published in

The Startup

8 min readJun 6, 2020

--

W e read books, newspapers, articles, emails, and magazines every day. You are reading an article right now. There is no denying that text in all forms plays a huge role in our lives. However, almost all the text one reads is stretched out unnecessarily long.

This paragraph,

I recently visited your company, and I was disgusted by the quality of your grass. As soon as I came near your building, I noticed that the grass was yellow. And brown. There were weeds everywhere, certain parts were overgrown, and others were cut too short. I don’t know how hundreds of people stand to walk past your building every day. I’m serious. Please do something about the grass outside your building, or your company will not be successful.

can be summarized into just three words:

Fix your grass.

By leveraging the power of natural language processing, text data can be summarized into concise and accurate segments of the original, capturing the main idea, while being short and easy to read.

The Basics of Summarization

General, accurate, and robust automatic text summarization would improve efficiency and work speed throughout the world. It is already being put to use in applications such as media monitoring, financial research, medical cases, and legal contract analysis.

Text summarization can be split into two main types

Extractive summary is choosing specific sentences from the text to compile a summary, while abstractive summary means generating a summary in the computer’s own words.

Along with that, there exist numerous subcategories, many unlisted:

Single-document or multi-document means to summarize a single piece of text, or to analyze a collection of texts on different topics, and create a summary that generalizes their opinions.
Indicative summary captures the general meaning of the text, while informative summary includes all the fine details.

Sequences

Sequenced data is data that takes the form of a list of varying length. Sequences can be difficult for traditional neural networks to process since there is the idea of an order, and the length may vary.

For example, consider the lyrics of a song, a sequence of words. The idea of an order means that certain words naturally come “before” others. It is easy to remember the words in the normal order, but much harder to recall the lyrics backwards.

In the real world, sequences can be any kind of data of varying length and has a general idea of an order. Some examples are texts, audio recordings, and video recordings.

Additionally, we may want to use sequences in the input, output, or even both, in a machine learning application.

However, it is challenging to perform calculations on them with normal neural networks. We cannot capture the idea of order, and we do not know how many nodes will be needed to represent a sequence.

Sequential Networks

New network architectures were discovered a few decades ago to deal with sequential data.

RNNs

Recurrent neural networks are a new type of network, in which their layers are used recurrently, or repeatedly. This means the layers are all the same. The network takes in a part of the sequence for each time step and performs some calculation on it. Specifically, for each time step, it uses the previous time step’s hidden layer and a new part of the input sequence to make a new output. This is then passed to the next time step, along with the next part of the sequence.

For some time step, say 2, this model takes in the vectors from the previous hidden layer (hidden 1) and the current input (input 2) to make a hidden result (hidden 2) and an output (output 2). The hidden result and output are the same vector. The operation inside the black box (hidden 2) is just a dot product followed by an activation function.

For each hidden layer, the weights and bias is the same. Hidden 1, 2, and 3 all use the same parameters, so we can train this for any sequence length and keep reusing the layers. The only difference between each hidden layer is that it receives different inputs, namely, the previous hidden layer and the input subsequence. The first hidden layer usually receives a vector of zeros as the hidden layer input.

Many outputs are created, and in different applications, we can whether choose to use them or not. Note that the outputs for a certain time step is exactly the same vector that is fed to the next time stamp as input.

If we change the direction of the picture slightly, it is actually very similar to a normal neural network.

The difference between a normal neural network and a recurrent neural network is that new inputs are constantly being fed in, we sometimes use the outputs of the hidden layers, and the layers are all the same.

LSTMs

The issue with recurrent neural networks is that it is hard remembering information over a long period of time. The information it may want to remember mixes with new information after each time step, and becomes very diluted.

Since the input is only half comprised of the previous hidden layer, the proportion of the previous information becomes exponentially smaller as time steps pass.

This animation, by Michael Phi, explains this concept very well:

The first input word “word” becomes exponentially smaller through time.

The long short term memory network is a type of recurrent neural network that has the added ability to choose what is important to remember, and what it should forget. Therefore, it is useful in both long term and short term memory.

An LSTM hidden layer, which takes the previous memory cell (ct-1), previous hidden layer (ht-1), and the input(xt), and outputs hidden layer output (ht), and a new memory cell (ct). Source: Guillaume Chevalier, own work, CC BY 4.0, on wikipedia

The difference between the RNN and the LSTM is the memory cell. This is where important memory is stored for a long period of time. The memory cell is a vector that has the same dimension as the hidden layer’s output.

Instead of being changed at each time stamp, as the hidden layers are, the LSTM has very strict rules on changing the memory cell.

The left side of the LSTM layer represents changes to the memory cell.

First, the previous hidden layer’s output and the current input is passed to a layer with a sigmoid activation function, to determine how much the memory cell should forget its existing value.
Second, the previous hidden layer and the current input is passed to a layer with a hyperbolic tangent activation function, to determine new candidates for the memory cell.
Finally, the the previous hidden layer and the current input is passed to a layer with a sigmoid activation function, to determine how much the candidates are integrated with the memory cell.

Note that the layers that decide what to forget and what to add are sigmoid layers, which output a number between 0 and 1. Since sigmoid is capable of outputting numbers very close to 0 and 1, it is very possible that memory is completely replaced.

Also note that the candidates are decided using the tanh function, which outputs a number between -1 and 1.

The right side of the LSTM layer represents how the memory cell changes the output of the whole layer.

After changes are made to the memory cell, the memory cell makes changes to the final hidden layer output.

The LSTM network is proficient at holding on to long term information, since it can decide when to remember information, and when to forget it.

Computers Suck English — They Are Only Good at Math

For a normal neural network to function, we must pass in some vectors as inputs, and expect some vectors as outputs. How can we do that when dealing with sequences of English text?

The answer, created in 2013 by Google, was an approach called Word2vec, that, unsurprisingly, mapped words to vectors. Continuous bag of words is the idea that two words are similar if they both appear in the same context (previous words), and skip-gram is the idea that two words are similar if they generate the same context (next words).

The vectors of similar words, like “poodles” and “beagles” would be very close together, and different words, like “of” and “math” would be far apart.

Another study by Stanford University in 2014 proposed a similar idea, but this time, stressing that words that appear in different frequencies should also be far apart, and words that appear about the same number of times should be close together.

The mapping of words to vectors is called word embeddings. They help us perform numerical operations on all kinds of texts, such as comparison and arithmetic operations.

Abstractive Text Summarizer

Combining the power of word embeddings and RNNs or LSTMs, we can transform a sequence of text just like a neural network transforms a vector.

To build a text summarizer, we first use word embeddings to map our input sequence words to a sequence of vectors.
Then, we use an autoencoder-like structure to capture the meaning of the passage. Two separate RNNs or LSTMs are trained to encode the sequence into a single matrix or vector, and then to decode the matrix or vector into a transformed sequence of words.
Lastly, convert the sequence of vectors outputted by the decoder back into words using the word embeddings.

Building an abstractive text summarizer, we would give the model labelled examples, in which the correct output is a summary.

However, this method can be generalized into transforming a sequence of text into another sequence of text. If we wanted to build a translator, for example, we would label each training example the translated text, instead of the summary.

Summary

A good text summarizer would improve productivity in all fields, and would be able to transform large amounts of text data into something readable by humans.
RNNs are similar to normal neural networks, except they reuse their hidden layers, and are given a new part of the input sequence at each time step.
LSTMs are special RNNs that are able to store memory for long periods of time by using a memory cell, which can remember or forget information when necessary.
Word embeddings capture the general usage of a word based on its context and frequency, allowing us to perform math on words.
An abstractive text summarizer would use an encoder and a decoder, surrounded by word embedding layers.