Convolutions in NLP— Paris NLP meetup

Which interpretation

CBTW
L’Actualité Tech — Blog CBTW
8 min readMay 25, 2018

--

NLP Meetup
NLP Meetup — Photo by Alessandro La Becca

What is it about?

Every couple of months, the Paris NLP (Natural Language Processing) Meet up gathers students, researchers, engineers and others around « […] NLP and text mining techniques, research, and applications, […] both traditional and modern NLP approaches, from hand-designed rules to machine learning & deep learning ».

Mercari Challenge

In a previous blog post, Jean-Baptiste talked about deep learning applied to NLP in order to tackle regression problems. As you can see on the figure below, this problem is far from being trivial : similar description can correspond to very different prices.

NLP Meetup

And, as Jean-Baptiste explained it in his previous blog posts, different models were tried to achieve the prediction:

NLP Meetup

Temporal VS Spatial Information

Sadly, most of the price of a product relies solely on a few key words that will indicate what we could take for a minimal price per feature:

NLP Meetup

Since the items description behaves like unordered lists of features, no major improvement could be expected from deeper models, hence the following results:

NLP Meetup

The previous results can be explained by the lack of information contained in the temporality. A deep model that is able to grasp temporal information will struggle learning anything more that individual feature weights (or weights for each token) when the input data behaves as an unordered list of tokens. This kind of representation can be called spatial by opposition to temporal representation.

Those results were obtained on the Mercari challenge and were further validated on leboncoin. This time, the target was the price of various vehicles and the only input was the description of the item — like we did with the Mercari challenge.

NLP Meetup

Again, the price of a car can be learned as a sum of various features (car brand, car options … ). And again, no improvement were made by using deep models.

Since we were dealing with unordered sequences of items (brands, options, and key-word-like features), there were few or no information added by the temporality ; the deep models basically learned individual weights for each token — like any TFIDF-based models would do.

When temporality matters

Given the previous results, we asked ourselves the following questions : can we highlight precise cases where temporality brings more information ? And what about regression ?

This two questions where answered by applying deep learning methods to a new problem : predicting the usefulness (as a ratio of votes between 0 and 1 for respectively a useless and a helpful comment) of an Amazon review.

NLP Meetup

Unlike previous examples, a spatial representation of this text contains less information that its temporal counterpart ; or it is at least the assumption we made. The example below shows the information loss due to unordered representations:

NLP Meetup

*BOW : Bag of Words

On the figure above, although the first sentence can be understood, we can see here the limits of unordered representations : with long sentences, whole texts, or whenever we think that the information relies in turns of phrases, temporality — and thus deep models by opposition to TFIDF-based models — will prove useful.

Convolutions in NLP

As some of you may already know, convolutional models used in image classification exhibits meaningful behavior : the classification is made by comparing patches of the input image to simple forms — called kernels of filters — by performing convolutions. Those simple forms can be further composed together with stacked convolution to extract abstract figures and classify the image. We will build the same model to tackle our regression problem on Amazon’s reviews (and with a single convolutional layer — you will see why in about a minute).

At this point, we made the assumption that deep models will perform better than TFIDF-based ones because of the temporal nature of the information. Also, we wanted to build a deep model that follows the following specification:

  • The overall architecture must be simple
  • The model will use convolution ; we want those convolution to be as meaningful as they can be when used in image classification
  • We must be able to accurately explain the model’s prediction

We then came up with the following architecture:

NLP Meetup

Lets decompose this figure step by step, from the left to the right:

  • Each word is represented by an embedding (in two dimensions in our exemple)
  • The whole review is a sequence of embeddings
  • Different convolution filters of width 5 (or 3 in some implementations) are applied to the sequence thus producing as many activation signals as we have filters (typically 1024 filters)
  • A high activation at some point of the sentence for a filter means detecting a precise and ordered sequence of embeddings
  • For each convolution kernel, the maximum activation of each signal is kept
  • Those maximums are then used to perform the regression itself thanks to a simple weighted sum and a sigmoid activation

This means that the regression is performed by comparing how strongly our sequence matches with some learnt embedding patterns.

Also bear in mind that, we can translate our learnt sequences of embeddings to actual words by assigning them the closest word of our vocabulary in the embedding space.

Sadly, again, no improvement were made compared to simple TFIDF-based models:

NLP Meetup

Convolution Interpretability

Nevertheless, we know have at our disposal a model that can explain its prediction with more then tokens weights but rather with sequences matches. Also, each kernel can be said positive or negative depending on the associated activation on the last layer of our model. We can now translate each embedding learnt by the most positive and the most negative filter to obtain the following figure:

NLP Meetup

Each column represents a filter column (i.e. a learnt embedding) and each line represents respectively the first, second and third closest word to the learnt kernel embedding.

This means that if we take one word per column we will produce a sequence of words that will yield a hight activation with the positive kernel.

Similarly, the same output can be produced but this time with the negative kernel:

NLP Meetup

To sum up, we are not learning individual weights for individual words as we did in spatial representations : we now learn typical sequences — or patterns — that matches certains sequences of embeddings.

Thanks to this architecture, we can precisely highlight which sequence of the input text triggered either the positive or the negative kernel (of course, all the kernels are used to perform the regression):

NLP Meetup

Our model gave a really low score to the sentence above (around 0.11) because the maximum of the negative kernel activation was so high by comparaison to the positive one.

The same goes with the following review:

NLP Meetup

The model classified it as really useful (around 0.89). Notice how overlapping sequences that differs by only one word can trigger really different kernels: this tends to prove that no specific vocabulary was learnt, but only embeddings patterns.

End word and last experiment

Although our model didn’t perform as good as we first expected, we managed to produce an interpretable deep model. Each prediction can be explained as tagging different chunk of the review with different filters.

Another benefit of this model is that the kernels can be use in different way once they are learnt :

  • They can be used to explain a specific prediction
  • They can be used to understand the dataset
  • They can be used in transfer learning tasks to shape a new embedding set according to some filters
  • And… to perform backpropagation up to the inputs (the word embeddings) given a fixed label !

Yes, we can apply nlp-deepdream methods on our inputs : the idea of the second part of the Meet Up was to mimic as many characteristics of the 2D convolutional models as possible. We’ve managed to make our kernels as meaningful as they are in image classifiers. Another characteristic (although it’s not specific to image classifiers) of those 2D models is that they can produce highly oneiric images by retropropagating up to the input image. Here is an exemple of a famous painting that was deep-dreamized:

NLP Meetup

The output of the model and the parameters are fixed, only the input can be learnt this time. The results are often… surprising to say the least.

So, we did the same on our inputs. We’ve fixed the parameters and the expected output, only the inputs could be modified. This produced a series of sentences where each word embedding was transformed, moved around the embedding space, to get as close as possible to an optimal solution. Of course, this retropropagation was only moving the embedding to be as close as possible to the filters, to maximise the sequence of dot products, but this stil gave funny results. The following sentence will be « morphed » into a negative one by learning new input embeddings (the weights of the model are frozen and the expected prediction is 0):

NLP Meetup

The author:

Daoud
Data Scientist & NLP specialist

--

--

CBTW
L’Actualité Tech — Blog CBTW

Nos experts partagent leur vision et leur veille en développement web et mobile, data et analytics, sécurité, cloud, hyperautomation et digital workplace.