Convolutions in NLP— Paris NLP meetup

Which interpretation

CBTW

Published in

L’Actualité Tech — Blog CBTW

8 min readMay 25, 2018

NLP Meetup — Photo by Alessandro La Becca

What is it about?

Every couple of months, the Paris NLP (Natural Language Processing) Meet up gathers students, researchers, engineers and others around « […] NLP and text mining techniques, research, and applications, […] both traditional and modern NLP approaches, from hand-designed rules to machine learning & deep learning ».

Mercari Challenge

In a previous blog post, Jean-Baptiste talked about deep learning applied to NLP in order to tackle regression problems. As you can see on the figure below, this problem is far from being trivial : similar description can correspond to very different prices.

And, as Jean-Baptiste explained it in his previous blog posts, different models were tried to achieve the prediction:

Temporal VS Spatial Information

Sadly, most of the price of a product relies solely on a few key words that will indicate what we could take for a minimal price per feature:

Since the items description behaves like unordered lists of features, no major improvement could be expected from deeper models, hence the following results:

The previous results can be explained by the lack of information contained in the temporality. A deep model that is able to grasp temporal information will struggle learning anything more that individual feature weights (or weights for each token) when the input data behaves as an unordered list of tokens. This kind of representation can be called spatial by opposition to temporal representation.

Those results were obtained on the Mercari challenge and were further validated on leboncoin. This time, the target was the price of various vehicles and the only input was the description of the item — like we did with the Mercari challenge.

Again, the price of a car can be learned as a sum of various features (car brand, car options … ). And again, no improvement were made by using deep models.

Since we were dealing with unordered sequences of items (brands, options, and key-word-like features), there were few or no information added by the temporality ; the deep models basically learned individual weights for each token — like any TFIDF-based models would do.

When temporality matters

Given the previous results, we asked ourselves the following questions : can we highlight precise cases where temporality brings more information ? And what about regression ?

This two questions where answered by applying deep learning methods to a new problem : predicting the usefulness (as a ratio of votes between 0 and 1 for respectively a useless and a helpful comment) of an Amazon review.

Unlike previous examples, a spatial representation of this text contains less information that its temporal counterpart ; or it is at least the assumption we made. The example below shows the information loss due to unordered representations:

*BOW : Bag of Words

On the figure above, although the first sentence can be understood, we can see here the limits of unordered representations : with long sentences, whole texts, or whenever we think that the information relies in turns of phrases, temporality — and thus deep models by opposition to TFIDF-based models — will prove useful.

Convolutions in NLP

As some of you may already know, convolutional models used in image classification exhibits meaningful behavior : the classification is made by comparing patches of the input image to simple forms — called kernels of filters — by performing convolutions. Those simple forms can be further composed together with stacked convolution to extract abstract figures and classify the image. We will build the same model to tackle our regression problem on Amazon’s reviews (and with a single convolutional layer — you will see why in about a minute).

At this point, we made the assumption that deep models will perform better than TFIDF-based ones because of the temporal nature of the information. Also, we wanted to build a deep model that follows the following specification:

The overall architecture must be simple
The model will use convolution ; we want those convolution to be as meaningful as they can be when used in image classification
We must be able to accurately explain the model’s prediction

We then came up with the following architecture:

Lets decompose this figure step by step, from the left to the right:

Each word is represented by an embedding (in two dimensions in our exemple)
The whole review is a sequence of embeddings
Different convolution filters of width 5 (or 3 in some implementations) are applied to the sequence thus producing as many activation signals as we have filters (typically 1024 filters)
A high activation at some point of the sentence for a filter means detecting a precise and ordered sequence of embeddings
For each convolution kernel, the maximum activation of each signal is kept
Those maximums are then used to perform the regression itself thanks to a simple weighted sum and a sigmoid activation

This means that the regression is performed by comparing how strongly our sequence matches with some learnt embedding patterns.

Also bear in mind that, we can translate our learnt sequences of embeddings to actual words by assigning them the closest word of our vocabulary in the embedding space.

Sadly, again, no improvement were made compared to simple TFIDF-based models:

Convolution Interpretability

Nevertheless, we know have at our disposal a model that can explain its prediction with more then tokens weights but rather with sequences matches. Also, each kernel can be said positive or negative depending on the associated activation on the last layer of our model. We can now translate each embedding learnt by the most positive and the most negative filter to obtain the following figure:

Each column represents a filter column (i.e. a learnt embedding) and each line represents respectively the first, second and third closest word to the learnt kernel embedding.

This means that if we take one word per column we will produce a sequence of words that will yield a hight activation with the positive kernel.

Similarly, the same output can be produced but this time with the negative kernel:

To sum up, we are not learning individual weights for individual words as we did in spatial representations : we now learn typical sequences — or patterns — that matches certains sequences of embeddings.

Thanks to this architecture, we can precisely highlight which sequence of the input text triggered either the positive or the negative kernel (of course, all the kernels are used to perform the regression):

Our model gave a really low score to the sentence above (around 0.11) because the maximum of the negative kernel activation was so high by comparaison to the positive one.

The same goes with the following review:

The model classified it as really useful (around 0.89). Notice how overlapping sequences that differs by only one word can trigger really different kernels: this tends to prove that no specific vocabulary was learnt, but only embeddings patterns.

End word and last experiment

Although our model didn’t perform as good as we first expected, we managed to produce an interpretable deep model. Each prediction can be explained as tagging different chunk of the review with different filters.

Another benefit of this model is that the kernels can be use in different way once they are learnt :

They can be used to explain a specific prediction
They can be used to understand the dataset
They can be used in transfer learning tasks to shape a new embedding set according to some filters
And… to perform backpropagation up to the inputs (the word embeddings) given a fixed label !

Yes, we can apply nlp-deepdream methods on our inputs : the idea of the second part of the Meet Up was to mimic as many characteristics of the 2D convolutional models as possible. We’ve managed to make our kernels as meaningful as they are in image classifiers. Another characteristic (although it’s not specific to image classifiers) of those 2D models is that they can produce highly oneiric images by retropropagating up to the input image. Here is an exemple of a famous painting that was deep-dreamized:

The output of the model and the parameters are fixed, only the input can be learnt this time. The results are often… surprising to say the least.

So, we did the same on our inputs. We’ve fixed the parameters and the expected output, only the inputs could be modified. This produced a series of sentences where each word embedding was transformed, moved around the embedding space, to get as close as possible to an optimal solution. Of course, this retropropagation was only moving the embedding to be as close as possible to the filters, to maximise the sequence of dot products, but this stil gave funny results. The following sentence will be « morphed » into a negative one by learning new input embeddings (the weights of the model are frozen and the expected prediction is 0):