How to build a text summarizer using Elmo and Spacy

Published in

auquan

4 min readNov 27, 2019

This article refers to a problem in Auquan’s Uk Data Science Varsity and Asia Invitational competition. You can view the problems and code along here: links.auquan.com/DSV

To see this article with code visit: www.auquan.com/community/tutorials/…

Introduction

There are several approaches to text building text summarises. Broadly, they can be split into two groups defined by the type of output they produce:

Extractive, where important sentences are selected from the input text to form a summary. Most summarization approaches today are extractive in nature.
Abstractive, where the model forms its own phrases and sentences to offer a more coherent summary, like what a human would generate. This approach is definitely more appealing, but much more difficult than extractive summarization.

From a practical point of view, Abstractive summaries require an extra leap of faith for people to adopt. Not only do they have to accept that the summarizer has identified the key sentences, but then actually created a summary that conveys that information without mistake. When high-value decisions are made on the back this summary, it makes sense to minimise this risk.

For this reason, we’ll be looking at an extractive approach based on this paper. from the University of Texas. Our pipeline is going to consist of the following steps:

Let’s look at each of these steps.

We use similar techniques to this for our own tools

Step one: Preparation

These initial bits are just downloading packages that we are going to need and getting the data for our problem (If you’re not doing one of these competitions you might want to skip this step).

Firstly, we’re going to turn on auto-reload for this notebook so that any changes to the external libraries get automatically included here.

%load_ext autoreload
%autoreload 2

Next, here is a list of the non-standard packages we’re using:

Uncomment to install any of the needed packages if you don’t have them already
# !pip install tensorflow==1.14
# !pip install tensorflow_hub
# !pip install -U wget
# !pip install -U spacy

Spacy is an industrial-grade NLP library that we’re going to use as a pre-trained model to help separate our sample text into sentences. We’re using the English, core, web trained, medium model, so the code is pretty self-explanatory.

And the rest:

Finally, we’re ready to import the data and get started. We’ve just copied all of this straight out of the template file for the problem. If you want to understand what each part is specifically doing, check out the getting started section of the problem page.

The last bit of preparation is downloading the data and reading it into the summarizer:

Step two: Encoding the data

As you might be aware from the title, we are using Elmo as our encoder. Developed in 2018 by AllenNLP, ElMo it goes beyond traditional embedding techniques. It uses a deep, bi-directional LSTM model to create word representations.

Rather than a dictionary of words and their corresponding vectors, ELMo analyses words within the context that they are used. This is important because a word like “sick”, may have entirely opposite meanings depending on the context. It is also character-based, allowing the model to form representations of out-of-vocabulary words.

This means that the way ELMo is used is quite different from word2vec or fastText. Rather than having a dictionary ‘look-up’ of words and their corresponding vectors, ELMo instead creates vectors on-the-fly by passing text through the deep learning model.

We’ll be using ElMo to create embeddings for the text.

Let’s start by working with just one document and parse it into sentences using Spacy

We’re going to use tensorflow hub as its slightly easier to do NLP project with. The Elmo model is all pre-trained and includes all the layers, vectors and weights that we will need.

Now we need to create the embeddings for our text

Then start up TensorFlow to do its magic

We should see a shape of X by1024, with X representing the number of sentences in the text and 1024 being the number of dimensions created by Elmo for each sentence.

Step three: Clustering

At this point, we now have all our sentences represented as embeddings. Clustering is going to group sentences that are similar, from which we will be able to create our final summary.

The number of clusters we create here will be the number of sentences we’ll have in our summary. Currently, the approach here is pretty basic and that will limit the quality of our final summary. It would be wise to experiment with different clustering approaches and numbers of clusters to see how that affects your final results.

For our simple approach, we’ve just used a K-means clustering algorithm, splitting the text into 10 sentences. Do you think this would be good for small documents? What about large ones?

Step four: Summarising

The final step is to take these clusters and create our summary. To do this we’re going to identify the center of the cluster and then find the sentence embedding that is closest to that point. These will then all be added together and returned as the text summary.

One thing to bear in mind here that is if you have some very different sentences e.g. those with a html link in. Then there are likely to in a unique cluster and called in the summary, even though they are not actually useful in the summary.

This problem can be found on Auquan’s Quant Quest: links.auquan.com/DSV

There you go.

Originally published at https://auquan.com.