How to implement CNN for NLP tasks like Sentence Classification

Rajat Newatia
Saarthi.ai
Published in
10 min readMay 27, 2019
Photo by Malcolm Lightbody on Unsplash

The aim of the article is to provide a general understanding of Convolutional Neural Network (CNN) and its implementation in Natural Language Processing (NLP) , demonstrated by performing Sentence Classification on the ‘Yelp’ review dataset ( the accuracy obtained on the dataset is not very high and doesn’t specifically concern with what is conveyed through this article).

In deep learning, a convolutional neural network (CNN, or ConvNet) is a class of deep neural networks, most commonly applied to analyzing visual imagery. For instance, CNN is used for applications such as image classification , facial recognition , object detection etc.(Wikipedia).

Most recently, however, Convolutional Neural Networks have also found prevalence in tackling problems associated with NLP tasks like Sentence Classification, Text Classification, Sentiment Analysis, Text Summarization, Machine Translation and Answer Relations.

General architecture of a Convolutional Neural Network :

A convolutional neural network is composed of “convolutional” layers and “downsampling” or “subsampling” layers

  • Convolutional layers comprise neurons that scan their input for patterns
  • Downsampling layers , or “pooling” layers are often placed after convolutional layers in a ConvNet, mainly to reduce the feature map dimensionality for computational efficiency, which can in turn improve actual performance.
  • Typically the two layers occur in an alternate order , but that’s not necessarily always the case.
  • This is followed by an MLP with one or more layers( fully connected layer)
Figure 1 : General architecture of a CNN performing digit recognition on a set of handwritten characters

How does the CNN architecture work?

When we think about images as the input, a computer has to deal with a 2-D matrix of numbers( pixel values) and therefore we need some way to detect features in this matrix.

A deep learning CNN model will pass this feature matrix through a series of convolution layers with filters (Kernals),RELU layer (or some other activation function),Pooling layers, fully connected layers (FC) and apply some activation function such as Sigmoid or Softmax function to classify an object with probabilistic values between 0 and 1.

Convolutional Layer

A convolutional layer can be thought of as composed of a series of “maps” called the “feature map” or the “activation map” . Each activation map has two components :

  • A linear map, obtained by convolution over maps in the previous layer (each linear map has, associated with it, a learnable filter or kernal)
  • An activation that operates on the output of the convolution

All the maps in the a given layer contribute to each convolution . Lets consider the contribution of a single map :

Consider a 5 x 5 image with binary pixel values i.e 0 or 1 and 3 x 3 filter matrix as shown in below

Figure 2 : A filter can be thought of as just a perceptron, with weights and a bias

The underlying map values are multiplied by the corresponding “filter” values i.e in a component wise manner, and the products are added along with the bias term to produce the convolved feature map for the next layer.

Figure 3 : Scanning an image with a “filter”

In the above example we used a Stride( number of pixels shifts over the input matrix) of 1 but we can also proceed with more than 1 pixel at a time. For instance, if we used a stride of 2 we would get the following result :

Figure 5 : Scanning with a stride of 2

In absence of any active downsampling and stride equal to 1 the size of output map should ideally be equal to that of input . For such cases we use a technique called “zero padding” which ensures the result of the convolution is the same size as the original image.

Pooling reduces the number of parameters when the images are too large. “Spatial pooling” or “Downsampling” reduces the dimensionality of each map but retains the important information. Spatial pooling can be of different types:

  • Mean Pooling
  • Max Pooling
  • P-norm Pooling
  • Sum Pooling

Mean pooling involves taking the mean of the elements from the rectified feature map. Similarly , Max pooling takes the largest element from the map and extracting the sum of all elements in the feature map is referred as sum pooling.

Figure 6: Mean pooling

Finally ,we flattened our matrix into vector and feed it into a fully connected layer like a MLP.

Figure 7 : Classification as FC layer

In the above diagram, feature map matrix will be converted as vector (x1, x2, x3, …). With the fully connected layers, we combined these features together to create a model. Finally, we have an activation function such as softmax or sigmoid to classify the outputs as regular digits (0,1,2,….,9).

Convolutional Neural Network for Sentence Classification

Now, that we have a basic understanding of how a Convolutional Neural Network works, and we have seen its implementation in computer vision, we will further discuss about its scope in Natural Language Processing.

Just like images can be represented as an array of pixel values (float values), similarly we can represent the text as an array of vectors( each word mapped to a specific vector in a vector space composed of the entire vocabulary) that can be processed with the help of a CNN .When we are working with sequential data, like text, we work with one dimensional convolutions, but the idea and the application stays the same. We still want to pick up on patterns in the sequence which become more complex with each added convolutional layer.

Here, we will be training a Convolutional Neural Network to perform sentence classification on a dataset containing reviews from “Yelp”. We will follow the following workflow:

1. Importing the data and preprocessing in into a desirable format( one we can work with) using pandas.

2. Using GloVe to obtain pre-trained word embeddings for our model.

3. Using Keras to train our data on a CNN architecture and evaluating the accuracy obtained on validation set.

Image Reference : http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/

Dataset

Download the dataset from the Sentiment Labelled Sentences Data Set from the UCI Machine Learning Repository. This data set includes labeled reviews from IMDb, Amazon, and Yelp. Each review is marked with a score of 0 for a negative sentiment or 1 for a positive sentiment.

Extract the folder into a folder dataset and go ahead and load the data with pandas:

import pandas as pdpath = “C:/Users/rajat/Desktop/dataset/sentiment labelled sentences”filepath_dict = {‘yelp’: ‘sentiment labelled sentences/            
yelp_labelled.txt’ ,‘amazon’: ‘sentiment labelled
sentences /amazon_cells_labelled.txt’,‘imdb’:
‘sentiment labelled sentences/imdb_labelled.txt’}
df_list = []
for source, filepath in filepath_dict.items():
df = pd.read_csv(filepath, names=[‘sentence’, ‘label’], sep=’\t’)
# Add another column filled with the source name
df[‘source’] = source
df_list.append(df)
df = pd.concat(df_list)
print(df.head())

The result will be as follows:

                                            sentence  label source
0 Wow... Loved this place. 1 yelp
1 Crust is not good. 0 yelp
2 Not tasty and the texture was just nasty. 0 yelp
3 Stopped by during the late May bank holiday of... 1 yelp
4 The selection on the menu was great and so wer... 1 yelp

We will just be using our Yelp review dataset to train our CNN model and we’ll be implementing everything using Keras ( Keras is a deep learning and neural networks API by François Chollet which is capable of running on top of TensorFlow (Google), Theano or CNTK (Microsoft)) .

Using Scikit-Learn :

train_test_split : Randomly split your dataset into the required training(75%) and testing set(25%).

Using Keras :

Tokenizer Utility Class : Vectorize a text corpus into a list of integers. Each integer maps to a value in a dictionary that encodes the entire corpus, with the keys in the dictionary being the vocabulary terms themselves. We can add the parameter num_words, which is responsible for setting the size of the vocabulary i.e the most common num_words will be then kept.

We have reviews in which each text sequence has a different length of words. To counter this, we can use pad_sequence() which simply pads the sequence of words with zeros. Additionally, we can add a maxlen parameter to specify how long the sequences should be. This cuts sequences that exceed that number.

from sklearn.model_selection import train_test_split             from keras.preprocessing.text import Tokenizer                    from keras.preprocessing.sequence import pad_sequencesdf_yelp = df[df['source'] == 'yelp']

sentences = df_yelp['sentence'].values
y = df_yelp['label'].values

sentences_train,sentences_test,y_train,y_test = train_test_split(
sentences, y,
test_size=0.25,
random_state=1000)

tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(sentences_train)

X_train = tokenizer.texts_to_sequences(sentences_train)
X_test = tokenizer.texts_to_sequences(sentences_test)
# Adding 1 because of reserved 0 index
vocab_size = len(tokenizer.word_index) + 1

maxlen = 100

X_train = pad_sequences(X_train, padding='post', maxlen=maxlen)
X_test = pad_sequences(X_test, padding='post', maxlen=maxlen)

Using GloVe(Pretrained Word Embeddings)

To obtain word embeddings we will simply be using GloVe(Global Vectors for Word Representation) developed by the Stanford NLP Group. Its Official documentation :

‘‘‘ GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space. ’’’

import numpy as np

def create_embedding_matrix(filepath, word_index, embedding_dim):
vocab_size = len(word_index) + 1
# Adding again 1 because of reserved 0 index
embedding_matrix = np.zeros((vocab_size, embedding_dim))

with open(filepath) as f:
for line in f:
word, *vector = line.split()
if word in word_index:
idx = word_index[word]
embedding_matrix[idx] = np.array(
vector, dtype=np.float32)
[:embedding_dim]

return embedding_matrix

we can use this function to retrieve the embedding matrix:

embedding_dim = 50
embedding_matrix = create_embedding_matrix('data/
glove_word_embeddings/
glove.6B.50d.txt' ,
tokenizer.word_index,
embedding_dim)

Training our CNN model:

from keras.models import Sequential
from keras import layers
embedding_dim = 100

model = Sequential()
model.add(layers.Embedding(vocab_size, embedding_dim, input_length=maxlen))
model.add(layers.Conv1D(128, 5, activation='relu'))
model.add(layers.GlobalMaxPooling1D())
model.add(layers.Dense(10, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
history = model.fit(X_train, y_train,
epochs=10,
validation_data=(X_test, y_test),
batch_size=10)

The result will be as follows:

Train on 750 samples, validate on 250 samples
Epoch 1/10
750/750 [==============================] - 4s 6ms/step - loss: 0.6872 - acc: 0.5707 - val_loss: 0.6732 - val_acc: 0.5920
Epoch 2/10
750/750 [==============================] - 3s 4ms/step - loss: 0.5317 - acc: 0.8133 - val_loss: 0.4984 - val_acc: 0.7800
Epoch 3/10
750/750 [==============================] - 3s 4ms/step - loss: 0.1780 - acc: 0.9600 - val_loss: 0.4802 - val_acc: 0.8120
Epoch 4/10
750/750 [==============================] - 3s 4ms/step - loss: 0.0434 - acc: 0.9893 - val_loss: 0.5176 - val_acc: 0.8160
Epoch 5/10
750/750 [==============================] - 3s 4ms/step - loss: 0.0087 - acc: 1.0000 - val_loss: 0.5563 - val_acc: 0.8080
Epoch 6/10
750/750 [==============================] - 3s 4ms/step - loss: 0.0037 - acc: 1.0000 - val_loss: 0.5948 - val_acc: 0.8160
Epoch 7/10
750/750 [==============================] - 3s 4ms/step - loss: 0.0020 - acc: 1.0000 - val_loss: 0.6172 - val_acc: 0.8200
Epoch 8/10
750/750 [==============================] - 3s 4ms/step - loss: 0.0014 - acc: 1.0000 - val_loss: 0.6360 - val_acc: 0.8160
Epoch 9/10
750/750 [==============================] - 3s 4ms/step - loss: 9.8691e-04 - acc: 1.0000 - val_loss: 0.6550 - val_acc: 0.8160
Epoch 10/10
750/750 [==============================] - 3s 4ms/step - loss: 7.5744e-04 - acc: 1.0000 - val_loss: 0.6729 - val_acc: 0.8160

We can see that 90% accuracy seems like a tough threshold to cross with this data set and a CNN might not be well equipped. The reason for such a plateau might be that:

  • Number of training samples is not enough
  • The data you have does not generalize well
  • Missing focus on tweaking the hyper-parameters

In general an RNN is a more ‘natural’ approach, given that text is naturally sequential. However, RNNs are quite slow and fickle to train and CNNs work best with large training sets where they are able to find generalizations. However with a more generalized amount of data and precise adjustment of hyperparameters , CNNs have been proven to have given close to the state of art results.

Conclusion

  1. With this article , I hope that you now have at least a basic understanding of what Convolutional Neural Network is, and what is the general architecture behind any CNN model.
  2. We also gained some insights about how this deep neural network architecture can be used in context of NLP (Performed sentence classification on Yelp review dataset).
  3. Just like sentence classification , CNN can also be implemented for other NLP tasks like machine translation, Sentiment Classification , Relation Classification , Textual Summarization, Answer Selection etc.

If you have any question feel free to comment. I would be really happy to help you out. If this article helped you in any way, don’t be shy to click the clap button!

--

--