A typical classifier, except the last two/few layers, is nothing but a feature extractor.
A feature extractor is made of many convolution layers with activation functions and occasionally spatial compression operations called MaxPooling. It produces multiple feature maps that flow down the network.
During training, the feature extractor learns to represent the important features in the image (of objects in the image) in different feature maps. While the first few layers are limited to simple features like edges and simple shapes, the latter layers learn complex patterns like ‘blue eyes’, ‘hair’, ‘legs’, etc. provide ‘meat’ to the output layers to be used to classify images with.
After the so-called feature extractor of the classifier, we have either a Flatten() or a Global Average Pooling layer before the final Sigmoid/Output layer.
The flattening layer
The flattening layer learns the best coefficients for linearly combining the attribute intensities in a way that predicts the object class. For example, the coefficients the classifier will learn for combining the ‘tail’, ‘fur’ and ‘four legs’ features will be such that a strong intensity in both features will result in the class prediction: ‘cat’.
And that is the case with many classical CNN architectures. The final set of layers is often of the fully connected type. Flatten() and sigmoid. This is like bolting an MLP onto the end of an image processor. The fully connected layers “interpret” the features maps from previous layers (feature extractor block) and make category predictions.
Global Average Pooling
Global Average Pooling replaces fully connected layers in classical CNNs. It is an operation that calculates the average output of each feature map in the previous layer. This is the same as setting the pool_size to the size of the input feature map. Instead of adding fully connected layers on top of the feature maps, we take the average of each feature map, and the resulting vector is fed directly into the Softmax layer.
So, instead of downsampling patches of the input feature map, global pooling downsamples the entire feature map to a single value.