How to implement Sentiment Analysis using word embedding and Convolutional Neural Networks on Keras.

On the Imdb movie reviews dataset.

Imdb has released a database of 50,000 movie reviews classified in two categories: Negative and Positive. This is a typical sequence binary classification problem.

In this article, I will show how to implement a Deep Learning system for such sentiment analysis with ~87% accuracy. (State of the art is at 88.89% accuracy).


Keras is an abstraction layer for Theano and TensorFlow. Meaning that we don’t have to deal with computing the input/output dimensions of the tensors between layers.

How to represent the words

Movie reviews are sequences of words. So first we need to encode them.

We map movie reviews to sequences of word embeddings. Word embeddings are just vectors that represent multiple features of a word. In Word2Vec, vectors represent relative position between words. One simple way to understand this is to look at the following image:

This comes from the deck of slides:
king - man + women = queen

After mapping every movie review to sequences of word embeddings, we need to pad the sequences to get the same length on all of them. i.e. we add zeroes to the small sequences and truncate the larger ones.

The model

Here we used a 3-layered convolution neural network with 2 dense layers.

Why Convolutional? Because it works. Convolutional layers are really powerful to extract higher level feature in images. And quite amazingly, they actually work in most 2D problems. Another big reason that should convince you is the training time, CNN train 50% to 60% faster than LSTMs on this problem.

The Keras model code:

This model has ~7M trainable parameters, which takes around 15min on a MacBook Pro (We used binary cross entropy loss here because it is a binary classification problem).

In this case the Convolutional layers extract features horizontally from multiple words. Allowing the network to extract higher level writing style.

Dropout was necessary because otherwise, the model was overfitting to the training data (96% accuracy on training data, 84% on test data). Crippling the network with holes during training reinforces the generalization power; it forces the network to build new paths and extract new patterns.


After 20 short minutes of training, we get 86.6% accuracy (87% if you are lucky). This is another advantage of CNN, they are extremely fast to train compared to LSTM for the same result (in this case at least).

Cumulative loss (Training set)
Accuracy (Training set)

State of the art is 88.89%.