Performance Analysis of Deep Learning Algorithms: Part 1

CNNs vs Capsule Networks on MNIST Dataset

Editorial @ TRN
The Research Nest
9 min readOct 27, 2018

--

Before you get started, we assume you are aware of the basics of what Deep learning is. If you are not, it’s alright, we will be providing basic explanations at relevant places along with the code for implementation.

In this series, we will first be testing various known algorithms and measure their performance over a wide range of datasets before we test our own algorithms and approaches.

In this analysis, we will be using the famous MNIST dataset. The MNIST database consists of 60,000 samples handwritten digits which can be used as a training set and has another 10,000 test samples. It is a part of a larger set available from NIST. The digits have been size-normalized and centred in a fixed-size image.

It is a good database for those who want to try learning various techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting.

We will be using Keras and Tensorflow backend. Keras provides the MNIST database inbuilt, it can be accessed as follows:

Dataset Visualization:

  • x_train and y_train are the training data of our model. Validation will be done on x_test and y_test.
  • x_train = 60,000 images of handwritten digits of 28 x 28 each.
  • y_train = 60,000 labels of the images in x_train.

This is how the first training image in x_train looks:

Code to visualise the data;

Image

Every image in the MNIST dataset is represented as an array of numbers describing how dark each pixel is. For example, we might think of 1 as something like a 2-D 28 x 28 matrix, where the dark pixels have high values.

A 2-D 28 x 28 matrix, where the dark pixels have high values

Techniques used:

  1. Convolutional Neural Networks:

When it comes to machine learning, Artificial Neural Networks performs very well compared than other algorithms. These neurons behave similar to the human neurons. They learn the information we passed to them. Neural Networks come in many forms and can be used for a wide variety of purposes. For example, we can use Recurrent Neural Networks more precisely an LSTM for predicting the sequence of words. CNNs are generally used in image recognition applications.

Architecture: CNN has similar learning parameters to the simple neural network. Following are the layers used in CNN.

  • Input layer: This layer holds an image input of shape (height, width, 3). where the first and second dimensions are the height and width of the input image and the third dimension is for RGB. Generally, the input image is reshaped such that height = width. It can be thought as a matrix we discussed in data visualization part.
  • Convolution Layer: This layer computes the output volume by computing dot product between all filters and the patch of the image. Suppose we have an input image of shape (28,28,3) and 12 filters of size 3 (this means the shape of 3*3*3) with no padding. This filter will iterate through the input image with the stride of 1. Produces an output volume of shape (26,26,12). We can get the output volume of (28,28,12) by applying padding on the borders of the input image of 1.
  • Activation Layer: It is an activation function that decides the final value of neuron. After applying convolution, it may be possible that the dot product produces negative values. To remove these negative values many CNNs use ReLu activation. This makes all the negative values equal to zero. f(x)= max(0,x).Some common activation functions are RELU: max(0, x), Sigmoid: 1/(1+e^-x), Tanh, Leaky RELU. The output volume remains same eg. (28,28,12).
  • Pooling Layer: Its function is to progressively reduce the spatial size of the representation to reduce the number of parameters and computation in the network. Pooling layer operates on each feature map independently. The most common approach used in pooling is max-pooling and another is average pooling. Max-pooling pools the maximum value from the patch of the size of filter given. Average pooling pools the average of all the values from the patch of the size of the filter.
  • Fully Connected layer: This layer is a simple neural layer which takes input and outputs the volume respective to the number of classes.
  • Dropout Layer: This layer is used to reduce the over-fitting in training.

A Quick Intuition:

Let’s start with the pretty simple example, suppose we have an image of a cross. This cross can be expressed as a matrix of 3 x 3 where all the diagonal elements are 1 and others are -1.

Convolutional Neural Networks learns about small features of the image. For example, if a human face is trained on CNN, it will learn the small features of our face such as lines, curves in the early phase of the network. As it proceeds up the hierarchy, it starts learning the complex features such as eyes, nose, lips etc. In the final layer, it predicts that it is a face or not.

So, let's resume our example of a cross. One can say that convolution + relu + pooling = feature extraction. Let’s prove it in the given example. Suppose we take a filter of size 2 as shown in the figure.

Iterating it through our input matrix and taking dot product we get the matrix [(-4, 4),(4,-4) as shown in the figure. This process is called Convolution.

Negative values have been rounded off by using ReLu activation to zero.

And the features which have large dot product have been pooled off by a max-pooling layer which results in a “backslash” like the feature of “cross” input as shown.

It’s understood that we can use a different filter to get the “forward-slash”. That’s why increasing the number of filters will force CNN to learn more and more features.

Program Code:

2. Capsule Networks

Owing to the complexity of the topic, we will here be only giving an outline of what a capsule network does.

Why Capsule Networks?

Let us give you an example, what CNN does is to extract the smallest features and trying to predict the complex features using the previous features. Suppose, CNN predicts a face. What features does a face have? Eyes, Nose, Lips etc. What if we predict on an image where the position of eyes nose and lips are random. Will it predict that to be a face? The answer is yes. CNN does not consider the actual position of the features. Therefore, a CNN is not robust.

The reason for not considering the position of features in CNN is the Max-pooling layer. This is because it ignores most of the instantiation parameters such as pose (position, size, orientation), deformation, velocity, albedo, hue, texture, etc.

That’s why capsule networks came into existence. Instead of taking maximum value it considers all the values in the filter and tweaks their coefficients according to their future prediction.

The detailed architecture used in our analysis can be found in the paper by Hinton titled, ‘Dynamic Routing Between Capsules’. We would highly recommend you to go through the same.

model.summary()

Program Code:

You can access the capsule networks code from the reference implemented by GitHub user XifengGuo: https://github.com/XifengGuo/CapsNet-Keras

Tabulation of Results:

After training for 50 epochs the statistics are as follows:

Compared with the CNN, modern Capsnet surpasses the old Convolution technique. It was expected to outperform CNN because Capsnet considers what CNN does not. Capsnet not only predicts but also reconstruct the given image in a clear and smooth form, here is the example.

The first five rows are the input digits and the next five rows are their reconstructed images. You can see that the digits having breaks are automatically filled after reconstruction. This is the power of Capsnet.

Following is the training loss and training and validation accuracy graph for Capsnet. Curves indicate that the model is trained right.

Stay tuned as we explore more diverse algorithms on interesting datasets!

This analysis was carried out by Anshul Warade of The Research Nest’s R&D Team.

Clap and share if you liked this one. Do follow ‘The Research Nest’ for more insightful content.

--

--