Unraveling a Keras model

3 min readNov 22, 2016

Keras is a great library for hands-on on neural networks, and it has a ton of great examples that makes it very easy to create ANNs & DNNs. So easy in fact, that you could even build one without knowing what’s going on.

I used the CNN model from this Keras blog post to create a simple sentiment analysis model. But to fully understand what I had just done, I had to dig a little deeper.

The basic model outlined in the post is using pre-trained word embeddings of the text to train a CNN for sentiment analysis. I have shown it below, with a few minor changes to padding sizes (border_mode=’same’), so that the convolution output size stays the same as its input (for simplicity).

model = Sequential()
model.add(Embedding(len(word_index) + 1,
                        EMBEDDING_DIM, weights=[embedding_matrix],
                        input_length=MAX_SEQUENCE_LENGTH,
                        trainable=False))
model.add(Convolution1D(64, 5, activation="relu", border_mode='same'))
model.add(MaxPooling1D(5))
model.add(Convolution1D(64, 5, activation="relu", border_mode='same'))
model.add(MaxPooling1D(5))
model.add(Flatten())
model.add(Dense(64, activation="relu"))
model.add(Dense(len(self.labels_index), activation="softmax"))model.compile(loss='categorical_crossentropy',
                optimizer='rmsprop', metrics=['acc'])

First, to understand the intuition behind this model, you can refer to Kim et al.’s paper on using CNN’s for sentence classification, as well as this Quid blog post. To summarize briefly, the convolution operation over N word embedding vectors (where N — filter size) can be said to check for the presence of absence of N-grams in the input word vectors, and that along with a maxpooling layer results in the most salient features being extracted as a feature for training the fully connected layers.

Three aspects of convolution layers (used traditionally in image problems) are important to keep in mind, to understand how they have been extended to text.

Filters are typically constructed to convolve over the dimensions that are continuous. In the case of an image, the input (pixels) is spatially continuous over both width and height dimensions, but not depth. In case of text (or other time series data) that has been represented as a continuous sequence (of words) , the continuous dimension would be over that sequence (/temporal) dimension. Further, if Word2Vec or Glove word embedding vectors are used instead of words or one-hot-encoded vectors, these vectors are continuous dense representations, and hence the filter can be constructed to also convolve over the embedding dimension.
Images can have several channels such as RGB, HSV. In case of text, multiple channels could translate to a different representation of the same input text, such as different word vectors for a word (as described by the Kim et al. paper)
In case of temporal data (text, time series…) max over time pooling suggested by Collobert et al. is seen as being more appropriate for some tasks. Max over time pooling does a MAX() operation over the time dimension, other dimensions left the same. If the output of each temporal convolution filter was a SEQUENCE_LENGTH X FEATURES matrix, then the pooling would reduce this to 1 X FEATURES vector.

Coming back to the Keras model, the preprocessing step is explained below. There were a few additional preprocessing steps I did on the input text like removing stopwords, irrelevant characters etc., so you can think of the Input Text (first step) as cleaned text.

Embedding layer output shape: (None, 1152, 300)
Convolution layer #1 output shape: (None, 1152, 64)
Max pool layer #1 output shape: (None, 230, 64)
Convolution layer #2 output shape: (None, 230, 64)
Max pool layer #2 output shape: (None, 46, 64)
After flattening output shape: (None, 2944)
Fully connected layer output shape: (None, 64)('None' holds the batch size)
1152 - Max sequence length

Using the model’s output_shape property in Keras at each step, it is a bit more clear how the input is being transformed through the layers of the CNN. The Convolution1D layer convolves over the input sequence along the entire word vector.

Also, note that there is only one ‘channel’ in this architecture. (output of embedding layer is a 3D tensor of the type (batch size, sequence length, embedding dimension)). To use more than one channel input, we might need to re-purpose the Convolution2D layer. [See this post].

The rest of the architecture is straightforward, so I won’t go into it.

One last note on how great Keras is as a library and community. Just reading through some of the discussions gave me a lot of answers and pointers.

Some more resources on this subject:

Unraveling a Keras model

Written by Amanda Dsouza