Deep Learning for Natural Language Processing — Part III

Published in

Applied Artificial Intelligence

9 min readJan 17, 2018

It’s been a month since I wrote the first part of this series. There, I shared the bit I know about word vector representations, some techniques and how to work with word2vec to analyse words’ similarities. In the second part, which was published one week after the first, I introduced some Deep Learning concepts, neural networks architectures and some details concerning regularisation, optimisation and activation functions. However, before writing the third part, I decided to take a detour and write something about how and where to get our models running. I came up with some Terraform scripts to automate AWS GPU instances for our use. That appendix, as I consider it, can be found here.

But today it’s all about what’s next on the list: Convolutional Networks. We have already seen that one-hot word representations, used in Traditional Machine Learning, cannot match the power of vectorisation. We have also witnessed that adding an Embedding Layer to our Deep Network enabled us to get a 93% accuracy, taking into account the Area Under the Curve. But now, what would CNNs bring to the table? In the coming sections, we will explore CNNs, its layers and see what it can do to the same problem we worked on in the part two of this series.

Convolutional [Neural] Networks

When professor Yann LeCun, inspired by Hubel & Wiesel (1962) and Fukushima (1982) on their Neocognitron work, presented the Convolutional Network Architecture, at NIPS 1989, and the improvements it brought to the field, it was certain that his paper was going to change the way people were and would be doing Machine Learning for decades to come.

Improving Artificial Intelligence, or more specifically Machine Learning, was going to be difficult with the existing network architectures. For instance, image recognition was slow due to heavy matrix multiplications. Imagine taking a vector of millions of pixels and multiplying it by a matrix with millions of weights. That couldn’t be good. Thanks to his inspiration in Neuroscience, Yann LeCun et al modelled the Convolutional Network Architecture, where the first layer plays the role of simple cells in the visual cortex, detecting local features. This is all done by Discrete Convolutions operations.

When it comes to the deeper layers in the network, the CNN pooling layers mimic the complex cells work by pooling the outputs of the simple cells. This is also called sampling. In between the convolution and sampling operations, there is also a non-linearity applied. And then the same process repeats: convolutional layer; non-linearity; and pooling layer.

How does it all work?

Before I proceed to explain how those operations work, it is important to understand what parameters we have at our hands. When we looked at the fully connected networks, being deep or not, we did not have to look that much at the layers themselves, but only at the linear operations to update the weights, the regularisation and optimisation parameters and the activation functions. However, when it comes to CNN, we do have a bunch of other parameters to look at.

Filters

As shown in the image above, the feature map works as a filter onto the image being analysed. The outcome is used in a max, average or L2 norm pooling operation, which will then be sub-sampled, so the window over which the aggregation was computed results in a representation of half the original input.

The number of filters used indicates the third dimension of the output. The equation about how get the output dimensions will be given further in the story.

Kernel

Concerning the window that hovers over the input, it is also know as kernel. The idea is that the kernel is going to hover of a certain area and apply the convolution operation. The image below depicts how this operation works in a more self-explanatory way.

The first cell on the top-left corner of the output will be the result of the following operation:

2*3+6*1+3*-1+3*4+6*0+4*0+7*4+9*2+8*3 = 91

Once this is done, the kernel will move a certain amount of positions, given by the number of strides, and will compute the value of the next cell in the output. If we use 1 as the number of strides, the result will be like depicted in the image below:

3*3+6*1+4*-1+7*4+9*0+8*0+1*4+6*2+2*3=55

When it reached the far right of the output, it will move down give the number of strides and start over again from the far left.

But how are the dimensions of the output computed and why there is a number 16 in red in the image? We will look into it after we cover the other parameters.

Strides

From what has been explained above, you might have an idea about how stride works. It’s an important parameter, as in when used wrongly some of the information in the image might be lost. For instance, if you use a small kernel with a big number os strides, your feature detector might not work well because some of the information won’t be taken into account.

Padding

Padding helps to avoid shrinking the output and throwing away information from the edges. We can use padding to add extra cells surrounding the input and it will help to compute and extract features from the edges of the input.

We can have use a valid padding, with will result in a different output dimension, usually shrinking the input, or same padding, which will make sure that the output has the same dimensions as the input.

One Equation to Rule Them All

But how does it get to work? How all those parameters come together? Let’s now see how we can calculate the output given the input and filter dimensions, not forgetting the details about the strides and padding ones.

Let’s assume the following parameters:

Filters = f = 32;
Filter dimensions = k = 3;
Padding = p = 2;
Strides = s = 2;
height & width = n = 64;
((n + 2 . p - k) / s) + 1 X ((n + 2 . p - k) / s) + 1 X f

Which means:

((64 + 2 . 2 - 3) / 2) + 1 = 33
((64 + 2 . 2 - 3) / 2) + 1 = 33
32

The final output would be: 33x33x32.

As you can see from the equation above, the third dimension is given by the number of filters used in the convolution operation. We can do the same for the data in the images on Convolutional operation 1 and 2. But remember, we had stride equals to 1 and no padding. The formula goes as depicted below:

((6 + 2 . 0 - 3) / 1) + 1 = 4
((6 + 2 . 0 - 3) / 1) + 1 = 4
16

The final output would be: 4x4x16.

That 16 there, in bold, is the same 16 in red you saw in the images above. The 3rd dimension of the output is given by the number of filters.

What should you take out of all this? Well, when working with images, think about the images’ dimensions, for the first convolutional layer, and the following inputs before defining what your parameters will look like. Taking a rule of thumb, one could say that the kernel dimensions are most of the time 3x3 (or just 3, if you use Keras); and the number of filters usually goes from 8, 16, 32, etc. Concerning padding, it is usually kept at default, valid, which means no padding at all. If one tries to look into Residual Networks, then there will be a lot of same padding.

Another important detail: the third dimension of the kernel is given by the number of channels in the images. If you are using coloured images, the the third dimension would be 3, which represents the RGB channels in the images. For greyscale images, the third dimension is 1.

By now you should know how some of the inner parts of a convolutional network works. Now let’s get to some more details concerning the problem we are working on here.

Do You Remember?

Do you still remember the goal of this story after so many details concerning Convolutional Networks?

Although I demonstrated the convolution operations when using images, we will be focusing on text. As you might have noticed, in contrary to fully connected neural networks, the convolutional networks do not need the input data to be unrolled and represented as a vector. Instead, the filters are applied on top of the input images as they are, respecting their dimensions.

When it comes to text data, it won’t work in the way it has been explained above, but it still works with convolutional networks, we just need a different layer configuration: 1 dimension convolutional layer.

Conv1D

When using CNNs to solve NLP problem, one will certainly apply Conv1D instead of what has been showed above with Conv2D. As the images above have depicted, the convolution works over the dimension of the images. But when working with text, it is slightly different: first, the sentences are padded / truncated to a certain maximum length; second, the words are encoded as vectors of a given input dimension; third, the kernel convolves along the dimensions given by the maximum length and the input dimension.

The Problem

We will now look into a sentiment analysis problem, based on the Large Movie Review Dataset, which was publish in a ACL 2011 paper, by Andrew Maas et al. It provides a set of 25,000 highly polar movie reviews for training, and 25,000 for testing.

As you might recall, that’s the same problem we have worked on in the Part II of this series.

Don’t Forget

If you can’t follow the code that is about to be exhibited below, or the hyper-parameters, cost function, optimiser, etc. Please, refer to Part II, where all of it is explained.

The bit I know about CNNs has already been covered in the sections above, from now on it will be only code.

It’s Mud Time

Yes, let’s get our hands dirty!

Import Dependencies

import keras
from keras.datasets import imdb
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Flatten, Dropout, Activation
from keras.layers import Embedding, Conv1D, SpatialDropout1D, GlobalMaxPool1D
from keras.callbacks import ModelCheckpoint
import os
import sklearn.metrics
from sklearn.metrics import roc_auc_score
import pandas as pd
import matplotlib.pyplot as plt

Hyper-parameters

output_dir = 'model_output/conv'epochs = 4
batch_size = 128n_dim = 64
n_unique_words = 5000
n_words_to_skip = 50
max_review_length = 200
pad_type = trunc_type = 'pre'
drop_embed = 0.2n_dense = 256
dropout = 0.2n_conv = 256
k_conv = 3

Load and Preprocess the Data

(X_train, y_train), (X_valid, y_valid) =
    imdb.load_data(num_words=n_unique_words)X_train = pad_sequences(X_train, maxlen=max_review_length, 
    padding=pad_type, truncating=trunc_type, value=0)X_valid = pad_sequences(X_valid, maxlen=max_review_length, 
    padding=pad_type, truncating=trunc_type, value=0)

Design the Convolution Network Architecture

model = Sequential()
model.add(Embedding(n_unique_words, n_dim, input_length=max_review_length))
model.add(SpatialDropout1D(drop_embed))
model.add(Conv1D(n_conv, k_conv, activation='relu'))
model.add(GlobalMaxPool1D())
model.add(Dense(n_dense, activation='relu'))
model.add(Dropout(dropout))
model.add(Dense(1, activation='sigmoid'))

Check the Model Summary

print(model.summary())

Create Model Checkpoint

modelcheckpoint = 
    ModelCheckpoint(filepath=output_dir+'/weights.{epoch:02d}.hdf5')if not os.path.exists(output_dir):
    os.makedirs(output_dir)

Compile the Model

model.compile(loss='binary_crossentropy', optimizer='adam',
    metrics=['accuracy'])

Train the Model

model.fit(X_train, y_train, batch_size=batch_size, epochs=epochs, 
    verbose=1, validation_split=.20,
    callbacks=[modelcheckpoint])

Load the Best Weights and Predict

# In my case it was the fourth one: 'weights.04.hdf5'
model.load_weights(output_dir+'/weights.04.hdf5')y_hat = model.predict_proba(X_valid)

Plot a Histogram based on the Predictions

plt.hist(y_hat)
_ = plt.axvline(x=0.5, color='orange')

Calculate the ROC AUC Score

pct_auc = roc_auc_score(y_valid, y_hat) * 100
print('{:0.2f}'.format(pct_auc))

Accuracy: 95.37

Plot the Area Under the Curve

fpr, tpr, _ = sklearn.metrics.roc_curve(y_valid, y_hat)
roc_auc = sklearn.metrics.auc(fpr, tpr)plt.figure()
lw = 2
plt.plot(fpr, tpr, color='darkorange',
         lw=lw, label='ROC curve (area = %0.2f)' % roc_auc)plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")plt.show()