Introduction to Deep Learning For Computer Vision
I will advise the reader to first go through this blog on Artificial Neural Networks.
From the biological science point of view, computer vision aims to come up with computational models of the human visual system. From the engineering point of view, computer vision aims to build autonomous systems which could perform some of the tasks which the human visual system can perform (and even surpass it in many cases). The two goals are of course intimately related. Deep learning is growing rapidly and is surpassing traditional approaches for machine learning since 2012 by a factor of approximately 10%–20% in accuracy. This blog gives an introduction to Deep Learning and its application in Computer Vision.
Deep Learning
For computer vision tasks, special architecture of Deep Learning is used and that is called a Convolutional Neural Network. Firstly we look into the basic components of a Conv net:
Convolution
A Conv layer consists of spatial filters that are convolved along the spatial dimensions and summed up along the depth dimension of the input volume. Due to weight sharing, they are much more efficient than fully-connected layers. A Conv layer has w⋅h⋅d⋅n number of parameters without bias (w is width of the filter, h is height of the filter, d is depth of the filter, n is number of filters) that need to be learned during training. In general one starts with a large filter size (e.g. 11x11) and a low depth (e.g. 32) and reduces the spatial filter dimensions (e.g. to 3x3) while increasing the depth (e.g. to 256) throughout the network.
Pooling
Conv layers are often followed by a Pool layer in order to reduce the spatial dimension of the volume for the next filter — this is the equivalent of a subsampling operation. The pooling operation itself has no learnable parameters. There are two types of pooling, Max-pooling consists of splitting the input in patches and outputting the maximum value in that patch, whereas in Average pooling we output the average value in that patch.
Most of the time, max-pooling layers are used in Deep Learning models due to the easier gradient computation. During Backpropagation, the gradient only flows in the direction of the single max activation which can be computed very efficiently. A few other architectures use average pooling, mostly at the end of a network or before the fully connected layers and without a noticeable increase in performance.
To understand all above operations mathematically one can refer to this.
Fully Connected Layer(Dense layer)
The Fully Connected(FC) layer connects every output from the previous layer with each neuron. Usually, the FC layer is used at the end to combine all spatially distributed activations of the previous Conv layers. The FC layers have the highest number of parameters (n_input*n_neurons, where n_input is the number of outputs of the previous layer and n_neurons is the number of neurons) in the model most computing time(almost 90%) is spent in the early Conv layers.
Final Output Layer
The final output layer of a Deep Neural Network plays a crucial role for the task of the whole network. Common choices are:
- Classification: Softmax layer, computes a value yi ∈[0,1] such that ∑yi=1 can be interpreted as a probability that xi to belong to a certain class
- Regression: Sigmoid layer, predict values yij ∈[0,1] for an output with j dimensions.
- Regression and classification: The tasks can be combined, by connecting 2 output layers and hence outputting both values at once. This is used for object detection with a fixed number of objects, e.g. output a regression per class
After defining a final output layer, one need to define as well a loss function for the given task. Picking the right loss is crucial for training a Deep Neural Net, common choices are:
- Classification: Cross-entropy, computes the cross-entropy between the output of the network and the ground truth label and can be used for binary and categorical outputs (via hot-one encoding).
- Regression: Squared error and mean squared error are common choices for regression problems.
- Segmentation: Intersection over union is a loss function well suited for comparing overlapping regions of an image and a prediction — however, it is not well suited because it returns zero for non overlapping regions; Mean Squared Error is a better choice.
Deep Learning Architecture
This section describes how one stacks the components described in the previous section-
Convolutional Neural Network(CNN)
These models contain convolutional layers(with a non-linear activation), pooling layers(non parametric) and Fully connected layers at the end.
A convolution layer extracts image features by convolving the input with multiple filters. It contains a set of 2-dimensional filters that are stacked into a 3-dimensional volume where each filter is applied to a volume constructed from all filter responses of the previous layer. If one considers the RGB channels of a 256x256 sized input image as a 256x256x3 volume, a 5x5x3 filter would be applied along a 5x5 2-dimensional region of the image and summed up across all 3 color channels. If the first layer after the RGB volume consist of 48 filters, it is represented as a volume of 5x5x3x48 weight parameters and 48 bias parameters. Using a convolution operation on the input volume and the filter volumes, the filter response (so called activation) results in an output volume with the dimensions 251x251x48 (using stride 1 and no padding). By padding the input layer with zeroes, one can force to keep the spatial dimensions of the activations constant throughout the layers. Each convolutional layer is followed by a non-linear activation function(preferably ReLU) which results in an activation with the same dimensions as the output volume of the previous convolutional layer. A pooling layer subsamples the previous layer and outputs a volume of same depth but reduced spatial dimensions. Using a max-pooling with 2*2 filter with stride 2(filter shifted for 2 pixels on every iteration), one ends up with a 128*128*48 volume after pooling.
Many layers of convolutions(with activation) and pooling are stacked and output of the last conv-pool layer is fed to a fully connected layer. After fully connected layer the output of this layer is fed to the Final Output Layer which produces the probability for various classes as the output.
For reference here is the Alexnet architecture:
For further studying conv nets here are a few useful links:
- Stanford Cs231n course: The course lecture along with notes are very helpful.
- Colah’s blog: One can read his section on ANN and Conv nets to gain more insights.
- Culurciello: Awesome for comparison of different State of the Art architecture.