Convolutional Neural Network (CNN)

10 min readNov 14, 2017

Convolutional neural networks were inspired by biological processes in which the connectivity pattern between neurons is inspired by the organization of the animal visual cortex.

1. What is Neuron?

In nature, neurons have a number of dendrites (inputs), a cell nucleus (processor) and an axon (output).
- Neurons are the basic unit of a neural network.
- They can be connected together, or used to gate connections between other neurons.
A neuron is like a function, it takes a few inputs and returns an output. When the neuron activates, it accumulates all its incoming inputs, and if it goes over a certain threshold it fires a signal through the axon.

The important thing about neurons is that they can learn.

2. What is a Neural Network?

A Neural Network is put together by hooking together many of our simple “neurons,” so that the output of a neuron can be the input of another. It consists of an input layer, multiple hidden layers, and an output layer. Every node in one layer is connected to every other node in the right next layer.

The main component of a CNN is a convolutional layer. Its job is to detect important features in the image pixels. Layers that are deeper (closer to the input) will learn to detect simple features such as edges and color gradients, whereas higher layers will combine simple features into more complex features. Finally, dense layers at the top of the network will combine very high level features and produce classification predictions.

3. Training a Neural Network
Targets: Determine the weights for the network.
Method: Using gradient descent to minimize the error function.
Deep Learning: Applying multiple hidden layers of neural network.

4. Concepts

4.1. Input/Output Volumes

CNNs are usually applied to image data. Every image is a matrix of pixel values. With colored images, particularly RGB (Red, Green, Blue)-based images, the presence of separate color channels (3 in the case of RGB images) introduces an additional ‘depth’ field to the data, making the input 3-dimensional. Hence, for a given RGB image of size, say 255×255 (Width x Height) pixels, we’ll have 3 matrices associated with each image, one for each of the color channels. Thus, the image in it’s entirety, constitutes a 3-dimensional structure called the Input Volume (255x255x3).

4.2. Features
A feature is a distinct and useful observation or pattern obtained from the input data that aids in performing the desired image analysis. The CNN learns the features from the input images. Typically, they emerge repeatedly from the data to gain prominence.

4.3. Filters (Convolution Kernels or Feature Detector)
- A filter (or kernel) is an integral component of the layered architecture.
- It refers to an operator applied to the entirety of the image such that it transforms the information encoded in the pixels.
- The kernels are then convolved with the input volume to obtain so-called ‘activation maps’.
- Activation maps indicate ‘activated’ regions, i.e. regions where features specific to the kernel have been detected in the input.
- The dimension of the filter is the same with the dimension of the input feature map.

4.4. Receptive Field

It is impractical to connect all neurons with all possible regions of the input volume. It would lead to too many weights to train, and produce too high a computational complexity.

Thus, instead of connecting each neuron to all possible pixels, we specify a 2 dimensional region called the ‘receptive field[14]’ (say of size 5×5 units) extending to the entire depth of the input (5x5x3 for a 3 colour channel input), within which the encompassed pixels are fully connected to the neural network’s input layer. It’s over these small regions that the network layer cross-sections (each consisting of several neurons (called ‘depth columns’)) operate and produce the activation map.

4.5. Activation Layer
Activation Layer is an activation function that decides the final value of a neuron. Suppose a cell value should be 1 ideally, however it has a value of 0.85, since you can never achieve a probability of 1 in CNN thus we apply an activation function. E.g. if cell value is greater than 0.7 make 1 else make it 0. In this way one can easily achieve an image with sharp features.

4.6. Pooling Layer

- The pooling layer is usually placed after the Convolutional layer. Its primary utility lies in reducing the spatial dimensions (Width x Height) of the Input Volume for the next Convolutional Layer. It does not affect the depth dimension of the Volume.
- The operation performed by this layer is also called ‘down-sampling’, as the reduction of size leads to loss of information as well. However, such a loss is beneficial for the network for two reasons:

— The decrease in size leads to less computational overhead for the upcoming layers of the network.
— It work against over-fitting.

4.7. Backpropagation
Back propagation is the process in which we try to bring the error down. By error, I mean the difference in y and y’. This will help w, to fit the data set that we gave to the network. We perform Back propagation using Gradient descent process. This process tries to bring the error value close to zero.

4.8. Fully Connected Layer
At the end of convolution and pooling layers, networks generally use fully-connected layers in which each pixel is considered as a separate neuron just like a regular neural network. The last fully-connected layer will contain as many neurons as the number of classes to be predicted. For instance, in CIFAR-10 case, the last fully-connected layer will have 10 neurons.

4.9. Overfitting
Refers to a model that models the training data too well (try to fit all training data) but does not fit the testing data.

4.10. Dropout
Is a regularization technique for reducing overfitting in neural networks by preventing complex co-adaptations on training data.

4.11. 1x1 Convolution
Normally signals are 2-dimensional so 1x1 convolutions do not make sense (it’s just pointwise scaling) and will be used with smaller kernel numbers to reduce dimensions.

4.12. Spatial Arrangement
There are three hyperparameters control the size of the output volume: the depth, stride and zero-padding.
- Depth (D) of the output volume is a hyperparameter: it corresponds to the number of filters we would like to use, each learning to look for something different in the input.
- Stride (S) with which we slide the filter. When the stride is 1 then we move the filters one pixel at a time. When the stride is 2 (or uncommonly 3 or more, though this is rare in practice) then the filters jump 2 pixels at a time as we slide them around. This will produce smaller output volumes spatially.

- Zero-Padding refers to the process of symmetrically adding zeroes to the input matrix. It’s a commonly used modification that allows the size of the input to be adjusted to our requirement. It is mostly used in designing the CNN layers when the dimensions of the input volume need to be preserved in the output volume.

5. Convolution Layer Formula

Accepts an input volume of size W1×H1×D1 (Weight x High x Dimension)
Requires four hyperparameters:

Number of filters: K

The filter size: F

The Stride Length: S

The amount of Zero Padding: P

Produces an output volume of size W2×H2×D2 where:

W2 = (W1 − F + 2*P)/S + 1

H2 = (H1 − F + 2*P)/S + 1

D2 = K

With parameter sharing, it introduces F*F*D1 weights (parameters) per filter, for a total of (F*F*D1)*K weights and K biases.
In the output volume, the d-th depth slice (of size W2×H2) is the result of performing a valid convolution of the d-th filter over the input volume with a stride of S, and then offset by d-th bias.
A common setting of the hyperparameters is F=3, S=1, P=1.

6. Layer Patterns

The most common form of a ConvNet architecture stacks a few CONV-RELU layers, follows them with POOL layers, and repeats this pattern until the image has been merged spatially to a small size. At some point, it is common to transition to fully-connected layers. The last fully-connected layer holds the output, such as the class scores. In other words, the most common ConvNet architecture follows the pattern:

INPUT -> [[CONV -> RELU]*N -> POOL?]*M -> [FC -> RELU]*K -> FC

Where the * indicates repetition, and the POOL? indicates an optional pooling layer. Moreover, N >= 0 (and usually N <= 3), M >= 0, K >= 0 (and usually K < 3). For example, here are some common ConvNet architectures you may see that follow this pattern:

INPUT -> FC, implements a linear classifier. Here N = M = K = 0.

INPUT -> CONV -> RELU -> FC

INPUT -> [CONV -> RELU -> POOL]*2 -> FC -> RELU -> FC. Here we see that there is a single CONV layer between every POOL layer.

INPUT -> [CONV -> RELU -> CONV -> RELU -> POOL]*3 -> [FC -> RELU]*2 -> FC Here we see two CONV layers stacked before every POOL layer. This is generally a good idea for larger and deeper networks, because multiple stacked CONV layers can develop more complex features of the input volume before the destructive pooling operation.