Text Summarisation with Pointer Generator Networks

14 min readJun 9, 2022

In this post I want to explain the details from “Get To The Point: Summarization with Pointer Generator Networks”. This is an older paper from 2017 but it has some very interesting ideas which caught my attention and I want to take the opportunity to explain them here.

I will assume basic knowledge of simple neural networks and machine learning concepts but I will explain all the necessary background needed to understand this particular paper. The paper will build on the original sequence-to-sequence models with attention (pre-transformer era), and I will be using visualisations from Jay Alammar’s post: Visualizing a Neural Machine Translation Model, which I highly recommend as supplementary reading! I will also use equations from Andrew Ng’s great Sequence Models course, which I also recommend for better understanding.

Introduction to Text Summarisation

Text summarisation is about extracting the most important and relevant information from a piece of text. For instance, summarisation of a conversation between multiple people or summarisation of news articles.

There are two types of summarisation techniques: extractive and abstractive. Extractive summarisation simply copies parts of the original text. Abstractive summarisation generates a meaningful summary, often using words not present in the original text. In this post we will focus on Pointer Generator Networks, an adaptation to the original abstractive text summarisation methods.

Sequence to Sequence Models

The input to a text summarisation model will be a piece of text. The output will be the summary — another piece of text. In order to encode a piece of text you first have to tokenise it. The tokens can be individual words or subwords. For now let’s assume each token is a word. We therefore have a sequence of words as both input and output and these models are called (many-to-many) sequence to sequence models.

Seq2seq models are made up of two components, an encoder and a decoder. The encoder takes all the input words and converts it into a context vector, encoding the original text. The decoder takes the context vector and generates an output text — the summary.

*Source: Jay Alammar’s “Visualizing a Neural Machine Translation Model” Post. [1]*

Before 2017 (when the Transformer architecture was introduced) a Recurrent Neural Network (RNN) was used for both the encoder and decoder. Let’s dive into how they work!

RNNs

First, let’s do a quick recap on a standard neural network:

Let W₁, b₁ and W₂, b₂ be the weights and biases for layer 1 and 2 respectively. Then the hidden layer/state has an activation a₁ and output y^ given by the following equations:

where g₁ is usually a tanh/ReLU/GeLU and g₂ a sigmoid/softmax activation function.

In the standard neural network above you would pass in the whole text to the input layer. RNNs instead process words sequentially one at a time.

We tokenise the input text into a sequence of words. We denote the word at time step t (the tth word) as x<t>. At each time step the RNN takes as input x<t> and the previous hidden unit. For instance the equations for the RNN at time step 1 are given by:

The only difference is that we now have an extra term Wₐa₁<0> which allows us to use the hidden state of the previous time step. Wₐ is an extra learnable matrix. Note, since at time step 1 we don’t have any previous hidden states we initialise a1<0> as a vector of zeros.

And in general, at time step t we have the following equations:

Why would we use an RNN?

The inputs and outputs can be of different lengths, so you would have to pad the inputs to some max length in a standard NN.
Standard NN doesn’t share features learnt across different positions of text. (e.g. “Harry” in position 1 vs “Harry” in position 10 should have a similar meaning).

The intuition is that information about the previous words is retained and gets passed through the network and used later on to make predictions.

[Later it was found that looking at words in front of the current word is also helpful and a bidirectional RNN was introduced. RNNs also suffer from forgetting words at the beginning of a sentence and later a GRU/LSTM started being used. Both BiRNNs and LSTMs actually date back to 1997!]

Embeddings

Each word x<t> is represented by an embedding vector. Embedding vectors have the property that words similar in meaning are close to each other in the vector space. For instance, the Word2Vec embedding has the property that,

I.e. the difference between the embedding vectors for queen and king is approximately the distance between the embedding vectors for woman and man. This difference could indicate the difference in gender between the two pairs for instance.

The Decoder

Just like the encoder, the decoder is also an RNN.

In the same way as before at each time step the RNN takes in the previous decoder hidden state a<t-1> and some input x<t>. There are three main differences from the encoder:

a<0> is now the final hidden state from the encoder (known as the context vector) which has all the encoded information about the input text.
The input x<1> is the embedding vector for the <EOS> (end of sentence) token and x<t> = y^<t-1> , i.e the output at time step t-1 is fed in as the input for time step t.
The decoder continues generating words until the <EOS> token is generated.

Decoder Outputs

The output of the decoder at each time step is a vector of logits which gets turned into a vector of probabilities via the softmax function. This is our prediction y^<t> and it is a vector of size equal to the vocabulary size. The vocabulary is all the words that the model knows and can possibly predict. Let’s do an example with a vocabulary containing just three words: cat, dog and mouse:

In this example the model predicts “cat”, “dog” and “mouse” with a probability of 0.2, 0.75, 0.05 respectively. This can be viewed as a probability distribution over the vocabulary, Pvocab. The labels are given by the vector y with a 1 in the position corresponding to the target word. Hence, taking the prediction with the highest probability, the model is predicting the word correctly.

The vocabulary tends to be around 50k words, (~30k tokens for BERT). Hence the actual vectors are actually a lot larger as seen below:

Loss function

To define the error of the model during training we use the cross entropy loss. Recall, in multiclass classification with C labels the loss on each training example is defined as:

The loss on each word can be considered as a classification problem with C=vocabulary size classes. For the example in (8) the loss would be

Note, as the correct label probability tends to 1 the loss function tends to 0 thus indicating the model is getting more accurate.

In reality, y will be a really sparse vector with all 0s and just a single 1 in the position corresponding to the position of the correct word in the vocabulary and the loss will simply be -log(y^).

The overall loss for the generated text can be calculated as the average of all the individual word losses.

Decoding Strategies

One way to decode is to always select the word with the highest probability in the output. But there are other ways too, so let’s have a look!

First we need to generalise what the model is actually finding. Given an input text x and an output text y, the model predicts the probability P(y|x). More precisely, if Tx is the number of words in the input and Ty is the number of words in the output the model predicts the probability:

The prediction of the first token is given by P(y<1>|x) and the probability of the first two tokens is given by

using the chain rule of probabilities P(AnB) = P(A|B)P(B) = P(B|A)P(A). For the whole sequence this becomes:

And the model tries to predict the sequence of words y which maximises this probability:

Now let’s look at the different decoding strategies!

Greedy Search

Greedy search simply selects the word with the highest probability as its next word

Source: “Patrick von Platen’s post: How to generate text”. [3]

Starting from the word “The”, the algorithm greedily chooses the next word of highest probability “nice” and so on, so that the final generated word sequence is (“The”,”nice”,”woman”) having an overall probability of 0.5×0.4=0.2.

However, this algorithm misses high probability words hidden behind a low probability word. Here we could have instead predicted “The”, “dog”, “has” with a higher overall probability of 0.4 x 0.9=0.36

Greedy search also causes the model to start repeating itself. (Ever tried to break the next word prediction on your phone?)

Beam Search

Beam search keeps the most likely num_beam of hypotheses at each time step and eventually chooses the hypothesis that has the overall highest probability. Using an example with num_beams=2:

At time step 1 besides the most likely hypothesis (“The”, “nice”) it also keeps track of the second most likely one (“The”, “dog”). At time step 2 it can then find (“The”, “dog”, “has”) as the highest probability sequence.

Note: unlike breadth first search of depth first search beam search runs much faster but is not guaranteed to find exact maximum for P(y|x).

Other methods to improve decoding include things like n-gram blocking — removing options that would create an n-gram that already exists; length normalisation -so the model does not favour short outputs; as well as various sampling methods.

Attention Mechanism

The attention mechanism was introduced in 2014 to help improve translations of long sentences. Machine translation models take a sequence of words in one language and output a sequence of words in another language, so it is very similar to summarisation.

Let’s look at an example of translating the French sentence “Je suis étudiant” into English. So far we have seen:

As it turned out in the above model the context vector turned out to be a bottleneck when decoding long sentences. The attention mechanism was introduced to allow the model to focus on the relevant parts of the input sequence as needed.

Let’s look at what happens at the first decoder step after introducing attention. Now instead of just passing in the last encoder hidden state, we pass all of them. Attention weights, α, are calculated for each encoder hidden state showing the importance of each state and hence the importance of each word (each hidden state is most associated with a certain token in the input). Using t’ to denote the input (French) sentence time step, and t for the output (English) time step, we define

These weights sum to one so they can be considered as probabilities.

The new context vector is calculated by a weighted sum of the hidden states with the attention weights:

In the below visualisation time step 4 is actually the first decoder time step (as there are 3 time steps in the encoder in this example) and hᵢ is an alternative notation to a<t’> for the encoder hidden states (i.e. h1 = a<1'>, h2=a<2'>, h3=a<3'>).

Recall the in the original RNN equations

For the decoder a₁<0> is the last hidden state of the encoder and x<1> is the embedding of the <EOS> token. We now additionally use the extra attention context vector (16) and we concatenate it with the current decoder hidden state:

The same process continues for all other time steps in the decoder with the attention weights and hence the context vector being recalculated at each time step. In general, the context vector at time step t is given by

And the attention weights are found as follows

So effectively to find the attention weights you pass the encoder hidden states and the current decoder states through another feed forward neural network, which learns how to calculate the weights.

The following is a visualisation of the whole decoding process with attention:

[Recall in our notation we would have the encoder hidden states: h1 = a<1'>, h2=a<2'>, h3=a<3’> and the decoder hidden states: h4=a<1>, h5=a<2> and context vectors C4 = c<1>, C5 = c<2>.]

Attention can be visualised conceptually in the following diagram:

*Source: “Neural Machine Translation By Jointly Learning to Align And Translate”, 2014*

You can see how the model paid attention correctly when outputting “European Economic Area”. In French, the order of these words is reversed (“européenne économique zone”) as compared to English. Other words in the sentence is in similar order.

So far we have just been talking about text translation models so let’s get to the point and talk about text summarisation specifically. Let’s see how attention is used in summarisation:

In the baseline Seq2Seq model with attention the model may attend to relevant words in the source text to generate novel words, e.g. to produce the novel word “beat” in the abstractive summary “Germany beat Argentina 2–0” the model may attend to the words “victorious” and “win” in the source text.

Summarisation with Pointer generator networks

Now we are finally ready to discuss the paper!

It was found that the usual sequence to sequence models with attention exhibit certain undesirable behaviours such as inaccurately reproducing factual details, an inability to deal with out-of-vocabulary (OOV) words, and repeating themselves.

The pointer-generator network introduced in this paper facilitates copying words from the source text via pointing which improves handling of OOV words, while retaining the ability to generate words. A coverage vector is used to track and control coverage of the source document which proves effective in eliminating repetition.

The following is an example of the problems just described:

Pointer Generator Network

This solves the problem for OOV words.

Note: the model vocabulary typically has some special tokens, one example is the <EOS> token and another is <UNK>. So the model can predict OOV words if e.g. its highest probability is for the <UNK> token.

*Source: “Get To The Point: Summarisation with Pointer-Generator Networks”, 2017*

For each decoder time step a generation probability p_gen ∈ [0,1] is calculated, which weighs the probability of generating words from the vocabulary versus copying words from the source text. The vocabulary distribution and the attention distribution are weighted and summed to obtain the final distribution, from which we make our prediction.

The intuition: when we are decoding and we want to predict an OOV word, then wouldn’t it be nice if we could instead copy a word from the source text? This is exactly what we can do — we can instead copy a word with some probability 1-pgen based on the attention distribution. A copy distribution from which to copy words can be made by summing the attention distributions per word.

We define the extended vocabulary as the union of the vocabulary and all words appearing in the source distribution. The probability distribution of the extended vocabulary is now:

If w is OOV, then, Pvocab will be zero and we can copy a novel word from the source document instead using the copy distribution. This fixes the problem of not returning <UNK> tokens. Similarly, if w does not appear in the source document Pcopy will be zero and we can generate a word from the vocabulary as normal.

Note that pgen is calculated at each decoder time step based on the context vector c<t>, decoder hidden state a<t> and input x<t> as follows:

Where the vectors wc, wa, wc and scalar bptr are learnable parameters and σ is the sigmoid activation function.

Coverage Mechanism

This solves the problem of repeating words in the summary. We now create a coverage vector to keep track which words we have already attended to. The idea is to use this to penalise the model so it doesn’t attend to the same words in the source text over and over again. We define the coverage vector as the sum of attentions over all previous time steps:

Intuitively, d<t> is a (unnormalized) distribution over the source document words that represents the degree of coverage that those words have received from the attention mechanism so far. Note that d<1> is a zero vector, because on the first timestep, none of the source document has been covered.

We then define an extra coverage loss which gets added to the loss function:

Intuition: If the coverage vector d shows that we already attended to that particular word then α needs to be small to minimise the loss, and α being small means we don’t re-attend to the same word. On the other hand, if d is low then we are safe to have a high α and attend to this word as the minimum will still be d and loss will be low.

Note the context vector is also used to calculate the attention weights by adding an extra term to equation (21):

Results

The results were promising:

The model became better at not-repeating words:

Rouge score increased on the CNN/Daily Mail dataset:

However they did, find that the model became more extractive as a result, and couldn’t generate abstractive summaries as well:

Further Work

On the CNN/Daily Mail datatest we can see on Papers with Code that there were some further improvements using Pointer Networks:

However, in general, on different datasets, Transformers are leading the field with models like BART and PEGASUS regularly achieving state of the art results but this is a topic for another post!

Source: Papers with Code, shows transformer models leading the field.

References/ Further Reading:

Attention model: https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/.
Embeddings with Wor2Vec: https://jalammar.github.io/illustrated-word2vec/.
Generating sequences in the decoder: https://huggingface.co/blog/how-to-generate
Sequence models course: https://www.coursera.org/learn/nlp-sequence-models?specialization=deep-learning
The original Summarisation with Pointer Generator Networks paper: https://arxiv.org/pdf/1704.04368.pdf.

Thanks for reading!

Notes: I used this mp4 to gif converter to insert gifs. I did the equations using Overleaf Latex editor and pasted them as images.