DeepLearning series: Convolutional Neural Networks

Michele Cavaioni
Machine Learning bites
9 min readFeb 23, 2018

In this blog, I will explain the details of Convolutional Neural Networks (CNNs or ConvNets), which have proven to be very effective in areas such as image recognition and classification.

Finally, I will move on to Residual and Inception networks, which help overcome issues related to training very deep networks.

CONVOLUTIONAL NEURAL NETWORK:

Computer vision is an exciting field, which has evolved quickly thanks to deep learning. Researchers in this area have been experimenting many neural-network architectures and algorithms, which have influenced other fields as well.

In computer vision, images are the training data of a network, and the input features are the pixels of an image. These features can get really big. For example, when dealing with a 1megapixel image, the total number of features in that picture is 3 million (=1,000 x 1,000 x 3 color channels). Then imagine passing this through a neural network with just 1,000 hidden units, and we end up with some weights of 3 billion parameters!

These numbers are too big to be managed, but, luckily, we have the perfect solution: Convolutional neural networks (ConvNets).

There are 3 types of layers in a convolutional network:

  • Convolution (CONV)
  • Pooling (POOL)
  • Fully connected (FC)

CONVOLUTION LAYER

A “convolution” is one of the building blocks of the Convolutional network. The primary purpose of a “convolution” in the case of a ConvNet is to extract features from the input image.

Every image can be represented as a matrix of pixel values. An image from a standard digital camera will have three channels — red, green and blue. You can imagine those as three 2d-matrices stacked over each other (one for each color), each having pixel values in the range 0 to 255.

Applying a convolution to an image is like running a filter of a certain dimension and sliding it on top of the image. That operation is translated into an element-wise multiplication between the two matrices and finally an addition of the multiplication outputs. The final integer of this computation forms a single element of the output matrix.

Let’s review this via an example, where we want to apply a filter (kernel) to detect vertical edges from a 2D original image.

The value 1 on the kernel allows filtering brightness, while -1 highlights the darkness and 0 the grey from the original image when the filter slides on top.

In the above example, I used a value of a stride equal to 1, meaning the filter moves horizontally and vertically by one pixel.

In this example the values of the filter were already decided in the convolution. The goal of a convolutional neural network is to learn the number of filters. We treat them as parameters, which the network learns using backpropagation.

You might be wondering how to calculate the output size, based on the filter dimensions and the way we slide it though the image. I will get to the formula, but first I want to introduce a bit of terminology.

You saw in the earlier example how the filter moved with a stride of 1 and covered the whole image from edge to edge. This is what it’s called a “valid” convolution since the filter stays within the borders of the image. However, one problem quickly arises. When moving the filter this way we see that the pixels on the edges are “touched” less by the filter than the pixels within the image. That means we are throwing away some information related to those positions. Furthermore, the output image is shrinking on every convolution, which could be intentional, but if the input image is small, we quickly shrink it too fast.

A solution to those setbacks is the use of “padding”. Before we apply a convolution, we pad the image with zeros all around its border to allow the filter to slide on top and maintain the output size equal to the input. The result of padding in the previous example will be:

Padding will result in a “same” convolution.

I talked about “stride”, which is essentially how many pixels the filter shifts over the original image. Great, so now I can introduce the formula to quickly calculate the output size, knowing the filter size (f), stride (s), pad (p), and input size (n):

Keep in mind that the filter size is usually an odd value, and if the fraction above is not an integer you should round it down.

The previous example was on a 2D matrix, but I mentioned earlier that images are composed of three channels (R-red, G-green, B-blue). Therefore the input is a volume, a stack of three matrices, which forms a depth identified by the number of channels.

If we apply only one filter the result would be:

where the cube filter of 27 parameters now slides on top of the cube of the input image.

So far we have only applied one filter at a time, but we can apply multiple filters to detect several different features. This is what brings us to the crucial concept for building convolutional neural networks. Now each filter brings us its own output We can stack them all together and create an output volume, such as:

Therefore, in general terms we have:

(with nc’ as the number of filters, which are detecting different features)

One-layer of a convolutional neural network

The final step that takes us to a convolutional neural layer is to add the bias and a non-linear function.

Remember that the parameters involved in one layer are independent of the input size image.

So let’s consider, for example, that we have 10 filters that are of size 3x3x3 in one layer of a neural network. Each filter has 27 (3x3x3) + 1 bias => 28 parameters.

Therefore, the total amount of parameters in the layer is 280 (10x28).

Deep Convolutional Network

We are now ready to build a complete deep convolutional neural network.

The following architecture depicts a simple example of that:

_ _ _ _ _ _ _

POOLING LAYERS

There are two types of pooling layers: max and average pooling.

Max pooling

We define a spatial neighborhood (a filter), and as we slide it through the input, we take the largest element within the region covered by the filter.

Average pooling

As the name suggests, it retains the average of the values encountered within the filter.

One thing worth noting is the fact that a pooling layer does not have any parameters to learn. Of course, we have hyper-parameters to select, the filter size and the stride (it’s common not to use any padding).

_ _ _ _ _ _ _

FULLY CONNECTED LAYER

A fully connected layer acts like a “standard” single neural network layer, where you have a weight matrix W and bias b.

We can see its application in the following example of a Convolutional Neural Network. This network is inspired by the LeNet-5 network:

It’s common that, as we go deeper into the network, the sizes (nh, nw) decrease, while the number of channels (nc) increases.

Another common pattern you can see in neural networks is to have CONV layers, one or more, followed by a POOL layer, and then again one or more CONV layers followed by a POOL layer and, at the end, a few FC layers followed by a Softmax.

When choosing the right hyper-parameters (f, s, p, ..), look at the literature and choose an architecture that was successfully used and that can apply to your application. There are several “classic” networks, such as LeNet, AlexNet, VGG, …

I won’t go into details for each one, but you can easily find them online.

These networks are normally used in transfer learning, where we can use the weights coming from the existing trained network and then replace the output unit, since training such a big network from scratch would require a long time otherwise.

_______________

RESIDUAL NETWORKS (RESNETS)

Very deep neural networks are difficult to train because of vanishing and exploding gradients. A solution to that is the use of “skip connections”, which allow you to take the activation function from one layer and feed it to another layer even much deeper in the network. Skip connections are the building blocks of a Residual Network (ResNet).

A residual block is sketched as:

which is computed as:

Therefore, with a residual block, instead of taking the regular, “main” path, we take a[l] and add it to a later layer before applying the non-linearity ReLu.

A Residual Network is composed of these residual blocks stacked together to form a deep network. In reality, a “plain”, regular network has a harder time to train, as the network gets deeper, so the training error increases. The advantage of using a Residual Network, instead, is to train a very deep network and keep decreasing the training error.

_____________

1x1 CONVOLUTIONS (Network-in-Network)

We saw on the previous example of convolutional neural networks that applying a pooling layer we essentially shrink the dimension nh and nw. By applying, instead, a 1x1 convolution, we can shrink the number of channels, therefore save on computation. Furthermore, it adds non-linearity to the network.

Let’s see the following example:

Here, the 1x1 convolution looks at each 36 (6x6) different positions, and takes the element-wise product between 32 numbers on the left and 32 number in the filter, and then it applies a ReLU non-linearity to it. If we apply “nc“ number of 1x1 convolutional filters, then the output will be, in our example, 6x6xnc.

_______________

INCEPTION NETWORK

When designing a layer for a ConvNet we need to pick many things: the number of filters, the type of layer (pooling, conv, etc..),… what if we didn’t have to choose, but get them all?

Yes, it sounds pretty greedy!

That’s what the Inception network does. It uses all these options and stacks them up:

The problem of applying the above architecture is the computational cost.

For example, looking at the computational cost of the 5x5 filter we have:

32 filters and each filter is going to be 5x5x192. The output size is 28x28x32.

So, the total number of multiplications it computes is:

(28*28*32) * (5*5*192) = 120 million!

Fortunately, there is a way to reduce the above number of computations. That comes with some little help from our friend described above, the 1x1 convolution.

The interposed 1x1 convolution reduces by 10 times the total computational cost since we have:

(28*28*16) * (1*1*192) + (28*28*32) * (5*5*16) = 12.4 million

Therefore, an inception network is built interposing 1x1 convolutions on the filters and stacking these modules together.

This blog is based on Andrew Ng’s lectures at DeepLearning.ai

--

--

Michele Cavaioni
Machine Learning bites

Passionate about AI, ML, DL, and Autonomous Vehicle tech. CEO of CritiqueMatch.com, a platform that helps writers and bloggers to connect and exchange feedback.