Secrets behind the Convolutional Neural Networks and LSTM

Vydyula Akhil
Nov 2 · 10 min read

Text Classification Using Convolutional Neural Network (CNN) :

CNN is a class of deep, feed-forward artificial neural networks ( where connections between nodes do not form a cycle) & use a variation of multilayer perceptrons designed to require minimal preprocessing. These are inspired by animal visual cortex.

I have taken reference from Yoon Kim paper and this blog by Denny Britz.

CNNs are generally used in computer vision, however they’ve recently been applied to various NLP tasks and the results were promising .

Let’s briefly see what happens when we use CNN on text data through a diagram.The result of each convolution will fire when a special pattern is detected. By varying the size of the kernels and concatenating their outputs, you’re allowing yourself to detect patterns of multiples sizes (2, 3, or 5 adjacent words).Patterns could be expressions (word n-grams?) like “I hate”, “very good” and therefore CNN’s can identify them in the sentence regardless of their position.

Image Reference : http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/

In this section, I have used a simplified CNN to build a classifier. So first use Beautiful Soup in order to remove some HTML tags and some unwanted characters.

def clean_str(string):
string = re.sub(r"\\", "", string)
string = re.sub(r"\'", "", string)
string = re.sub(r"\"", "", string)
return string.strip().lower()texts = [];labels = []for i in range(df.message.shape[0]):
text = BeautifulSoup(df.message[i])
texts.append(clean_str(str(text.get_text().encode())))for i in df['class']:
labels.append(i)

Here I have used Google Glove 6B vector 100d. Its Official documentation :

‘‘‘ GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space. ’’’

For an unknown word, the following code will just randomise its vector. Below is a very simple Convolutional Architecture, using a total of 128 filters with size 5 and max pooling of 5 and 35, following the sample from this blog.

sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)
l_cov1= Conv1D(128, 5, activation='relu')(embedded_sequences)
l_pool1 = MaxPooling1D(5)(l_cov1)
l_cov2 = Conv1D(128, 5, activation='relu')(l_pool1)
l_pool2 = MaxPooling1D(5)(l_cov2)
l_cov3 = Conv1D(128, 5, activation='relu')(l_pool2)
l_pool3 = MaxPooling1D(35)(l_cov3) # global max pooling
l_flat = Flatten()(l_pool3)
l_dense = Dense(128, activation='relu')(l_flat)
preds = Dense(len(macronum), activation='softmax')(l_dense)

Here is the architecture of the CNN Model

When to Apply a 1D CNN?

A CNN works well for identifying simple patterns within your data which will then be used to form more complex patterns within higher layers. A 1D CNN is very effective when you expect to derive interesting features from shorter (fixed-length) segments of the overall data set and where the location of the feature within the segment is not of high relevance.

This applies well to the analysis of time sequences of sensor data (such as gyroscope or accelerometer data). It also applies to the analysis of any kind of signal data over a fixed-length period (such as audio signals). Another application is NLP (although here LSTM networks are more promising since the proximity of words might not always be a good indicator for a trainable pattern).

What is the Difference Between a 1D CNN and a 2D CNN?

CNNs share the same characteristics and follow the same approach, no matter if it is 1D, 2D or 3D. The key difference is the dimensionality of the input data and how the feature detector (or filter) slides across the data:

“1D versus 2D CNN” by Nils Ackermann is licensed under Creative Commons CC BY-ND 4.0

Problem Statement

In this article we will focus on time-sliced accelerometer sensor data coming from a smartphone carried by its users on their waist. Based on the accelerometer data of the x, y and z axis, the 1-D CNN should predict the type of activity a user is performing (such as “Walking”, “Jogging” or “Standing”). You can find more information in my two other articles hereand here. Each time interval of the data will look similar to this for the various activities.

Example time series from the accelerometer data

How to Construct a 1D CNN in Python?

There are many standard CNN models available. I picked one of the models described on the Keras website and modified it slightly to fit the problem depicted above. The following picture provides a high level overview of the constructed model. Each layer will be explained further.

“1D CNN Example” by Nils Ackermann is licensed under Creative Commons CC BY-ND 4.0

But let’s first take a look at the Python code in order to construct this model:Running this code will result in the following deep neural network:

_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
reshape_45 (Reshape) (None, 80, 3) 0
_________________________________________________________________
conv1d_145 (Conv1D) (None, 71, 100) 3100
_________________________________________________________________
conv1d_146 (Conv1D) (None, 62, 100) 100100
_________________________________________________________________
max_pooling1d_39 (MaxPooling (None, 20, 100) 0
_________________________________________________________________
conv1d_147 (Conv1D) (None, 11, 160) 160160
_________________________________________________________________
conv1d_148 (Conv1D) (None, 2, 160) 256160
_________________________________________________________________
global_average_pooling1d_29 (None, 160) 0
_________________________________________________________________
dropout_29 (Dropout) (None, 160) 0
_________________________________________________________________
dense_29 (Dense) (None, 6) 966
=================================================================
Total params: 520,486
Trainable params: 520,486
Non-trainable params: 0
_________________________________________________________________
None

Let’s dive into each layer and see what is happening:

  • Input data: The data has been pre processed in such a way that each data record contains 80 time slices (data was recorded at 20 Hz sampling rate, therefore each time interval covers four seconds of accelerometer reading). Within each time interval, the three accelerometer values for the x axis, y axis and z axis are stored. This results in an 80 x 3 matrix. Since I typically use the neural network within iOS, the data must be passed into the neural network as a flat vector of length 240. The first layer in the network must reshape it to the original shape which was 80 x 3.
  • First 1D CNN layer: The first layer defines a filter (or also called feature detector) of height 10 (also called kernel size). Only defining one filter would allow the neural network to learn one single feature in the first layer. This might not be sufficient, therefore we will define 100 filters. This allows us to train 100 different features on the first layer of the network. The output of the first neural network layer is a 71 x 100 neuron matrix. Each column of the output matrix holds the weights of one single filter. With the defined kernel size and considering the length of the input matrix, each filter will contain 71 weights.
  • Second 1D CNN layer: The result from the first CNN will be fed into the second CNN layer. We will again define 100 different filters to be trained on this level. Following the same logic as the first layer, the output matrix will be of size 62 x 100.
  • Max pooling layer: A pooling layer is often used after a CNN layer in order to reduce the complexity of the output and prevent over fitting of the data. In our example we chose a size of three. This means that the size of the output matrix of this layer is only a third of the input matrix.
  • Third and fourth 1D CNN layer: Another sequence of 1D CNN layers follows in order to learn higher level features. The output matrix after those two layers is a 2 x 160 matrix.
  • Average pooling layer: One more pooling layer to further avoid over fitting. This time not the maximum value is taken but instead the average value of two weights within the neural network. The output matrix has a size of 1 x 160 neurons. Per feature detector there is only one weight remaining in the neural network on this layer.
  • Dropout layer: The dropout layer will randomly assign 0 weights to the neurons in the network. Since we chose a rate of 0.5, 50% of the neurons will receive a zero weight. With this operation, the network becomes less sensitive to react to smaller variations in the data. Therefore it should further increase our accuracy on unseen data. The output of this layer is still a 1 x 160 matrix of neurons.
  • Fully connected layer with Soft max activation: The final layer will reduce the vector of height 160 to a vector of six since we have six classes that we want to predict (“Jogging”, “Sitting”, “Walking”, “Standing”, “Upstairs”, “Downstairs”). This reduction is done by another matrix multiplication. Softmax is used as the activation function. It forces all six outputs of the neural network to sum up to one. The output value will therefore represent the probability for each of the six classes.

Training and Testing the Neural Network

Here is the Python code to train the model with a batch size of 400 and a training and validation split of 80 to 20.

The model reaches an accuracy of 97% for the training data.

...
Epoch 9/50
16694/16694 [==============================] - 16s 973us/step - loss: 0.0975 - acc: 0.9683 - val_loss: 0.7468 - val_acc: 0.8031
Epoch 10/50
16694/16694 [==============================] - 17s 989us/step - loss: 0.0917 - acc: 0.9715 - val_loss: 0.7215 - val_acc: 0.8064
Epoch 11/50
16694/16694 [==============================] - 17s 1ms/step - loss: 0.0877 - acc: 0.9716 - val_loss: 0.7233 - val_acc: 0.8040
Epoch 12/50
16694/16694 [==============================] - 17s 1ms/step - loss: 0.0659 - acc: 0.9802 - val_loss: 0.7064 - val_acc: 0.8347
Epoch 13/50
16694/16694 [==============================] - 17s 1ms/step - loss: 0.0626 - acc: 0.9799 - val_loss: 0.7219 - val_acc: 0.8107

Running it against the test data reveals an accuracy of 92%.

Accuracy on test data: 0.92Loss on test data: 0.39

This is a good number considering that we used one of the standard 1D CNN models. Our model also scores well on precision, recall, and the f1-score.

precision    recall  f1-score   support0                 0.76      0.78      0.77       650
1 0.98 0.96 0.97 1990
2 0.91 0.94 0.92 452
3 0.99 0.84 0.91 370
4 0.82 0.77 0.79 725
5 0.93 0.98 0.95 2397avg / total 0.92 0.92 0.92 6584

Here is a brief recap of what those scores mean:

Long Short Term Memory:

Long Short-Term Memory (LSTM) is one of the most widely used recurrent structures in sequence modeling. It uses gates to control information flow in the recurrent computations.

LSTM networks are very good at holding long term memories. The memory may or may not be retained by the network depending on the data. Preserving the long term dependencies in the network is done by its Gating mechanisms. The network can store or release memory on the go through the gating mechanism.

There is some confusion about how LSTM models differ from MLPs, both in input requirements and in performance. One way to become more comfortable with LSTM models is to generate a data set that contains some lagged components, then build both a LSTM and regular MLP’s model to compare their performance and function.

First we generate the uni-dimensional input that both models will need.

#Load Packages
import numpy as np
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Activation#Generate 2 sets of X variables
#LSTMs have unique 3-dimensional input requirements
seq_length=5
X =[[i+j for j in range(seq_length)] for i in range(100)]
X_simple =[[i for i in range(4,104)]]
X =np.array(X)
X_simple=np.array(X_simple)

Here is the LSTM-ready array with a shape of (100 samples, 5 time steps, 1 feature)

And the MLP-ready array has a shape of (100 samples, 1 feature). Note the key difference is the lack of time steps or sequence.

Next generate a simple lagged y-variable.

y =[[ i+(i-1)*.5+(i-2)*.2+(i-3)*.1 for i in range(4,104)]]
y =np.array(y)X_simple=X_simple.reshape((100,1))
X=X.reshape((100,5,1))
y=y.reshape((100,1))

This is what the y-array looks like.

So now we can see how the LSTM model is trying to find a pattern from the sequence [0, 1, 2, 3, 4, 5] to → 6, while the MLP is only focused on a pattern from [4] to → 6.

Next we build the LSTM model.

model = Sequential()
model.add(LSTM(8,input_shape=(5,1),return_sequences=False))#True = many to many
model.add(Dense(2,kernel_initializer=’normal’,activation=’linear’))
model.add(Dense(1,kernel_initializer=’normal’,activation=’linear’))
model.compile(loss=’mse’,optimizer =’adam’,metrics=[‘accuracy’])
model.fit(X,y,epochs=2000,batch_size=5,validation_split=0.05,verbose=0);
scores = model.evaluate(X,y,verbose=1,batch_size=5)
print(‘Accurracy: {}’.format(scores[1])) import matplotlib.pyplot as plt
predict=model.predict(X)
plt.plot(y, predict-y, 'C2')
plt.ylim(ymax = 3, ymin = -3)
plt.show()

Here we can see the LSTM model doing a fairly good job at prediction until the upper range. Normalization should address this.

And now for the MLP model with near-identical parameters.

model2 = Sequential()
model2.add(Dense(8, input_dim=1, activation= ‘linear’ ))
model2.add(Dense(2, activation= ‘linear’ ))
model2.add(Dense(1, activation= ‘linear’ ))
model2.compile(loss=’mse’,optimizer=’rmsprop’,metrics=[‘accuracy’])
model2.fit(X_simple,y,epochs=2000,batch_size=5,validation_split=0.05,verbose=0);
scores2 = model2.evaluate(X_simple,y,verbose=1,batch_size=5)
print(‘Accurracy: {}’.format(scores2[1]))

The MLP model virtually perfect. This is why they call them universal function approximators!

The likely reason why the MLP outperformed the LSTM is because of lag component only spanned 3 time steps. Some sources have stated that when the relationships span longer time frames, LSTMs will tend to perform best.

Difference between CNN and RNN are as follows :

CNN:

  1. CNN take a fixed size input and generate fixed-size outputs.
  2. CNN is a type of feed-forward artificial neural network — are variations of multilayer perceptrons which are designed to use minimal amounts of preprocessing.
  3. CNN’s use connectivity pattern between its neurons is inspired by the organization of the animal visual cortex, whose individual neurons are arranged in such a way that they respond to overlapping regions tiling the visual field.
  4. CNN’s are ideal for images and videos processing.

RNN:

  1. RNN can handle arbitrary input/output lengths.
  2. RNN, unlike feedforward neural networks, can use their internal memory to process arbitrary sequences of inputs.
  3. Recurrent neural networks use time-series information (i.e. what I spoke last will impact what I will speak next.)
  4. RNN’s are ideal for text and speech analysis.

Links and References

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade