Convolutional Neural Network and Regularization Techniques with TensorFlow and Keras

Ahmad Omar Ahsan
intelligentmachines
12 min readJun 5, 2020
From TensorFlow playground

This GIF shows how the neural network “learns” from its input. We don’t want the neural network to pick up unwanted patterns nor do we want the neural network to miss out on the obvious patterns. We are going to learn about a technique that prevents neural networks from picking up unwanted patterns.

In this article, we are going to be talking about CNN and the regularization techniques available in TensorFlow Keras API. First I will try to give you an intuitive sense of what Convolutional Neural Network is and then I will introduce the regularization techniques.

But before we begin

I am assuming that the readers of this article are familiar with the following:

  1. Neural Network
  2. TensorFlow basics

But even if you are not, don’t worry — I will try to make it easy to understand so that you will be able to understand these concepts.

Convolutional Neural Network

Unlike humans, a machine doesn’t see an image as it is. When machines are provided with an image what they see are numbers or a matrix of pixels like the one below

So you might ask how can a machine identify any object from these numbers. Let’s look at how we identify objects first.

If you ask someone to describe a bike, that person will say that the bike has a handle, it has two tires, an engine that is visible, a seat for the rider, etc. When these objects of a bike are combined in a particular manner you will get a bike. Now you could say that’s not enough, how do you know that a tire is a tire. It’s simple a tire has a circular shape, black in color, etc. As you can see these features are what make a tire a tire. Similarly, each component of the bike has some distinctive features and these features make up the component.

Similarly, a machine will detect the features from the input image. First, it will try to identify the edges. From the edges that it identified it will try to form shapes i.e. a tire. Combining other shapes it will end up detecting a bike.

A CNN recognizes edges in earlier layers and more complex forms in later layers. Source: https://goo.gl/1KsWvF

So a CNN takes pixel as input, identifies edges, combines those edges to get shapes, and finally combines those shapes to get the object. To extract those objects a filter is used which multiplies with the pixels in the following manner

As you can see the filter is moving pixel by pixel, multiplying, and storing them on output. This filter is used for detecting vertical edges in an image.

Now in order to detect the edges of an object in your training set, the filters in your Convolutional neural network has to be trained so that the filters have the correct values.

So like any other deep learning algorithms, at each step, CNN tries to predict what your image contains. Then it calculates the loss between actual output and predicted output. If it’s bad then it updates the values of your parameters in this case the values of your filter.

So a conv layer is basically your input multiplying with a filter to give an output

Pooling

Pooling is a technique that is used after a conv layer. In order to compensate for the time taken to compute, we often use pooling to reduce the size of our output from the previous layer in a CNN.

There are two types of pooling:

  1. Max pooling: Here only the maximum value of m*m matrix is taken. In this way, only the best features are taken
  2. Average pooling: Takes the average value of size m*m. By using this method of pooling we are keeping the average value of the features

Here is a GIF that shows how max-pooling works

In the case of average pooling, the average value is taken

Dense layers

Dense layers are nothing more than a layer of nodes or neurons. So once you are the end of your network you flatten all your multidimensional into a single vector and this layer is known as a dense layer. At the last layer, the number of nodes will depend on the number of classes you have. These nodes will give you the probability of the class of the image.

So to sum it up a convolutional neural network is basically like this:

Image -> Conv layer -> Pool Layer -> Conv layer -> Pool Layer - > Flatten -> Dense -> Dense-> Output

Regularization

Imagine that you have a model that has learned to detect pictures of cats. The model that you have defined has a lot of parameters. You will notice that the loss value is minimal, however when you deploy your model to detect a picture of a cat in a different setting compared to your training set you will notice that your model will not be able to recognize the cat very well.

This happens because your model performed extremely well on the training set. We use the term ‘overfitting’ to describe models that perform extremely well on the training set but fail to generalize well on a test set (set of images that your model has not seen before).

Even with cute high-quality pictures of the cat on the left you model fails to recognize the cat on the right

One way to prevent overfitting is to use regularization. Regularization is a method that controls the model complexity. In this example, the images have certain features that help the model identify it as a cat, like a cat’s whisker, ears, eyes, etc. Each feature is assigned a certain weight. If there are a lot of features then there will be a large number of weights, which will make the model prone to overfitting. So regularization reduces the burden on these weights. The weights then end up having less impact on the loss function which determines the error between the actual label and predicted label.

There are several regularization techniques available and we are going to discuss some of them below

  1. L1 regularization
  2. L2 regularization
  3. Dropout

The best way to understand is to work on a data set and get your hands dirty. Let’s start coding.

What we are going to do …

We are going to create a model that detects rock, paper, and scissors. There are plenty of datasets that are available online and I am going to use the rock paper scissors dataset that’s available in Kaggle.

Some images from the rock paper scissor dataset

Here is my notebook to help you follow along: Rock, paper and scissor classifier

Let’s first import all the libraries and packages that we are going to be using

You need a function to plot the loss and accuracy of the model that was trained so that you can observe the change graphically.

Keras preprocessing has a class called ImageDataGenerator. It generates batches of tensor image data with real-time data augmentation. With this, you can add any form of data augmentation technique you want and you can specify the validation split. Here I used horizontal flip, vertical flip, height shift, and rescaling for augmentation. The reason why you might use flips is because you can take the picture of your hand from any angle. So by using flipping you are going to help your model generalize on better on different scenarios. If it were to say, train on horizontal images of hands it might not predict hands in a vertical orientation.

Now that we are done with preprocessing let’s define our models we are going to create 4 models one with one without regularization and 3 models with the regularization technique mentioned above. Then we are going to observe the loss and accuracy of our model.

Model architecture

We are going to use a model without regularization first. The model has 4 conv-pool layers and 2 dense layers.

In the code, you can see Conv2D, MaxPooling2D, Flatten, and Dense. Let me explain what those functions do.

  1. Conv2D: Conv2D performs 2-dimensional convolution on your images. It takes an input which is a tensor (matrix with more than 2 dimensions) and gives convoluted tensor as output. It takes the following parameters: the number of filters, filter dimension, regularized, and if this is the first layer then the shape of the input, activation function.
  2. MaxPooling2D: Performs pooling on 2D data. As parameters, it only needs the shape of the pooling filter. It reduces the shape of the Matrix
  3. Dense: It simply describes a layer of neurons. You need to provide the activation function
  4. Flatten: Flattens the data. The output of conv layer is 2D or more. So flatten converts them into a unidimensional vector

Here is the summary of the model

Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d (Conv2D) (None, 222, 222, 64) 1792
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 111, 111, 64) 0
_________________________________________________________________
conv2d_1 (Conv2D) (None, 109, 109, 64) 36928
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 54, 54, 64) 0
_________________________________________________________________
conv2d_2 (Conv2D) (None, 52, 52, 128) 73856
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 26, 26, 128) 0
_________________________________________________________________
conv2d_3 (Conv2D) (None, 24, 24, 128) 147584
_________________________________________________________________
max_pooling2d_3 (MaxPooling2 (None, 12, 12, 128) 0
_________________________________________________________________
flatten (Flatten) (None, 18432) 0
_________________________________________________________________
dense (Dense) (None, 512) 9437696
_________________________________________________________________
dense_1 (Dense) (None, 3) 1539
=================================================================
Total params: 9,699,395
Trainable params: 9,699,395
Non-trainable params: 0
_________________________________________________________________

From the summary of the model, you want to identify which layers have the most parameters. I am going to apply regularization to the three layers with the highest parameters. From the snippet above we can see that dense, conv2d_3 and conv2d_2 have the largest parameters. So I am going to apply regularization to those layers

We use model.compile() to tell what kind of loss, optimizer we will be using and what metric we want to observe. Once you have compiled your model initiate your model’s training with model.fit()

The history variable stores information about your trained model. Once you initiate your training with model.fit() you will observe that the TensorFlow Keras API will train and validate for some steps giving out your training loss and accuracy as well as validation loss and accuracy

Train for 69 steps, validate for 14 steps
Epoch 1/20
69/69 [==============================] - 229s 3s/step - loss: 1.3479 - accuracy: 0.5238 - val_loss: 0.4981 - val_accuracy: 0.8146
Epoch 2/20
69/69 [==============================] - 215s 3s/step - loss: 0.4871 - accuracy: 0.8158 - val_loss: 0.2581 - val_accuracy: 0.8993
Epoch 3/20
69/69 [==============================] - 213s 3s/step - loss: 0.2690 - accuracy: 0.9168 - val_loss: 0.1773 - val_accuracy: 0.9245
Epoch 4/20
69/69 [==============================] - 213s 3s/step - loss: 0.1974 - accuracy: 0.9397 - val_loss: 0.1367 - val_accuracy: 0.9565
Epoch 5/20
69/69 [==============================] - 212s 3s/step - loss: 0.2294 - accuracy: 0.9397 - val_loss: 0.0882 - val_accuracy: 0.9840

Then call your show_history function to observe the loss graph and accuracy graph for both training and validation of each. For the model with no regularization, the graph looks something like this.

Now that you know the process of how we are going to code the model and display the architecture let’s get down to the theory and code of the regularized models

L1 regularization

For L1 regularization you have to add the following value to the cost function L(x,y) of your model.

Loss function with regularization where lambda is the hyperparameter

Here theta is your parameter that you multiply with your input to get a prediction. In L1 regularization you take the modulus of all parameters and sum them up. Lambda is set before initiating the training

If you increase the value of lambda then the parameters will be small because the L1 expression annotated with the box will penalize your parameters.

However, one downside of L1 regularization is that you will end up with parameters with sparse values. The values of the parameters will be close to zeros.

You can add the L1 regularizer in the layers such as conv_2d by specifying the kernel_regularizer

This is the code snippet of a model with L1 regularizer.

As you can see we have added the tf.keras.regularizer() inside the Conv2d, dense layer’s kernel_regularizer, and set lambda to 0.01 . To train and compile the model use the same code as before

After training for 20 epochs(loops) we observe a very weird graph of accuracy.

Loss and accuracy graph of training and validation after applying L1 regularization

You can see that loss is very high and the accuracy of the model is very low. There are a few possible causes of these values.

  1. We have a dataset with only 2188 images which is very low. So our model didn’t have enough data for training.
  2. L1 regularization makes the parameters theta sparse which means that the values are mostly zeros
The dots in the sparse matrix are zeros

3. We have added too many regularizers which could result in the model ‘underfitting’ the data

There are 3 ways to improve the performance of the model

  1. Use more data
  2. Use a different regularization technique
  3. Use fewer regularizer

L2 regularization

The annotated box represents the formula for L2 regularization where lambda is the regularization hyperparameters

In L2 regularization we take the sum of all the parameters squared and add it with the square difference of the actual output and predictions. Same as L1 if you increase the value of lambda, the value of the parameters will decrease as L2 will penalize the parameters. The difference is that the weights will not be sparse and we will get much better accuracy values compared to L1.

tf.keras.regularizers.l2() denotes the L2 regularizers. After 20 epochs the graphs look like this. Train using the same step as before

Almost good as the model without regularization

Since the parameters were not sparse the model did not underfit and we have low loss and high accuracy exactly what we want

How exactly do L1 and L2 penalize?

This has to do with backpropagation. Backpropagation is another topic in itself but the basic intuition is simple

At every step your model makes predictions. It finds the difference between the actual output and the prediction. If the difference is not the minimum then the model must make another prediction that is much closer to the output. To make different predictions the model must have different parameters that are better than the previous parameters. The parameters are updated by taking the gradient of the loss function at each step.

The formula for gradient descent with regularization

This is the way the regularizers penalize the parameters of the model.

Dropout

As you can see the nodes are being ignored randomly at each iteration with a probability of 0.5

Dropout is another regularization technique that is used widely in models. What it does is it drops some random nodes in your layers in the neural network.

So the benefit of using dropout is no node in the network will be assigned with high parameter values, as a a result the parameter values will be dispersed and the output of the current layer will not depend on a single node.

The model architecture is like this:

tf.keras.layers.Dropout(0.2) drops the input layers at a probability of 0.2.

After training and visualizing with the above code the graph looks like this:

The losses are very low and the accuracy of the dropout model is high. The parameters are not sparse. So if you are planning to use regularization on your models go for dropout models

Usually, dropout is placed on the fully connected layers or dense layers only because they are the ones with the greater number of parameters and thus they’re likely to excessively co-adapting themselves causing overfitting.

Improving the performance of the model

There are few ways to improve the performance of the models

  1. Use fewer conv layers. I have used 4 conv layers you can opt for two
  2. Use regularization on the layer with the highest parameters only
  3. Collect more data.

Congratulations on making this far. I hope by reading this article you can understand regularization better.

If you want to learn more about regularization and other tuning techniques I recommend this playlist

This is the 2nd course of the deep learning specialization where Andrew Ng explains the concepts of improving hyperparameters and how to apply them practically

Further Reading

  1. Data set used: Rock paper scissors.

2. To experiment with regularization visually go to TensorFlow playground

--

--