CNN vs MLP for Image Classification
Why CNN is preferred over MLP (ANN) for image classification?
MLPs (Multilayer Perceptron) use one perceptron for each input (e.g. pixel in an image) and the amount of weights rapidly becomes unmanageable for large images. It includes too many parameters because it is fully connected. Each node is connected to every other node in next and the previous layer, forming a very dense web — resulting in redundancy and inefficiency. As a result, difficulties arise whilst training and overfitting can occur which makes it lose the ability to generalize.
Another common problem is that MLPs react differently to an input (images) and its shifted version — they are not translation invariant. For example, if a picture of a cat appears in the top left of the image in one picture and the bottom right of another picture, the MLP will try to correct itself and assume that a cat will always appear in this section of the image.
Hence, MLPs are not the best idea to use for image processing. One of the main problems is that spatial information is lost when the image is flattened(matrix to vector) into an MLP.
We thus need a way to leverage the spatial correlation of the image features (pixels) in such a way that we can see the cat in our picture no matter where it may appear. Solution? — CNN !
Convolutional Neural Network (CNN): More generally, CNNs work well with data that has a spatial relationship. Therefore CNNs are go-to method for any type of prediction problem involving image data as an input.
The benefit of using CNNs is their ability to develop an internal representation of a two-dimensional image. This allows the model to learn position and scale in variant structures in the data, which is important when working with images.
It can account for local connectivity (each filter is panned around the entire image [Convolution] according to certain size and stride(no. of pixels/blocks to shift around after every convolution step), allows the filter to find and match patterns no matter where the pattern is located in a given image). The weights are smaller and shared — less wasteful, easier to train than MLP and more effective too. They can also go deeper. Layers are sparsely connected rather than fully connected.
It takes matrices as well as vectors as inputs.
The layers are sparsely connected or partially connected rather than fully connected. Every node does not connect to every other node.
CNN’s leverage the fact that nearby pixels are more strongly related than distant ones.
We analyze the influence of nearby pixels by using something called a filter /Kernel and we move this across the image from top left to bottom right. For each point on the image, a value is calculated based on the filter using a convolution operation.
A filter could be related to anything, for pictures of humans, one filter could be associated with seeing noses, and our nose filter would give us an indication of how strongly a nose seems to appear in our image, and how many times and in what locations they occur. This reduces the number of weights that the neural network must learn compared to an MLP, and also means that when the location of these features changes it does not throw the neural network off.
The panning of filters (you can set the stride and filter size) in CNN essentially allows parameter sharing, weight sharing so that the filter looks for a specific pattern, and is location invariant — can find the pattern anywhere in an image. This is very useful for object detection. Patterns can be discovered in more than one part of the image.
Additionally, it can also find the similar pattern even if the object is somewhat rotated/tilted using a concept called Pooling, which makes CNN more robust to changes in the position of the feature in the image, referred to by the technical phrase “local translation invariance.”
After the filters have passed over the image, a feature map is generated for each filter. These are then taken through an activation function, which decides whether a certain feature is present at a given location in the image. We can then do a lot of things, such as adding more filtering layers and creating more feature maps, which become more and more abstract as we create a deeper CNN. We can also use pooling layers in order to select the largest values on the feature maps and use these as inputs to subsequent layers. In theory, any type of operation can be done in pooling layers, but in practice, only max pooling is used because we want to find the outliers — these are when our network sees the feature!
In Pooling, it basically takes a filter and a stride of the same length. It then applies it to the input volume and outputs the maximum number in every sub-region that the filter convolves around. The intuitive reasoning behind this Pooling layer is that once we know that a specific feature is in the original input volume (there will be a high activation value), its exact location is not as important as its relative location to the other features. As you can imagine, this layer drastically reduces the spatial dimension (the length and the width change but not the depth) of the input volume.
The neural network (in MLP) will learn different interpretations for something that is possibly the same. But in CNN, the number of weights is dependent on the kernel size (see Weight sharing) instead of the input size which is really important for images. So, by forcing the shared weights among spatial dimensions which drastically reduces the number of parameters, the convolution kernel acts as a learning framework.
That’s how convolutional layers reduce memory usage and compute faster.
Note 1: Spatial Information refers to information having location-based relation with other information. Space represents the 2D plane(x-y) in images.
Earlier layers of CNN are convolutional layers, which take into account the image as a 2D (spatial) information. Whereas, the deeper layers flatten that (convoluted) information in first conv layer, it extracts spatial information like edges, corners etc. and in other conv layer it extracts spatial information like eyes, nose etc.
This is spatial information in images. CNN’s don’t maintain spatial relationship among features. Advanced capsnet addresses this problem.
Note 2: CNNs are designed to be spatially invariant, that is — they are not sensitive to the position of, for example, object in the picture. The deeper you go into layers, the originally not so (pixel wise) similar objects (or usually parts of objects) are becoming more similar (and this is achieved via convolution). At the deepest layers we have extracted features with no information on where they were positioned on the original image. We even lose the information on pixel-size of original objects because of another process in CNN called pooling.
Convolution is the key for why CNNs perform better than any other model in such “human-like” tasks like recognizing specific objects in the picture, words in a recorded speech and other tasks.