[ML] What’s the way a CNN works

Short Version

A CNN works essentially by building a Stack of Lower and Lower Dimensional, hence (hopefully) More and More Abstract, Representations of the Input.

This kind of strategy allows to automatize the critical phase of “Feature Engineering” which was typically performed manually.

This Multilayer Stacked Lower Dimensional Representation is typically built alternating

  • Convolutional Layers, which perform a local information recombination by means of learnable convolutive filters
  • Pooling Layers, which perform the actual Dimensionality Reduction

The final Representation of this Convolutive Stack is in a Learned Feature Space which is used as Starting Point for further task specific processing (e.g. FeedForward Neural Network to solve a Classification Problem)

Long Version

The final goal of the Convolutive Stack (Fig.1 Stack 1) is to find a Representation which is good enough to effectively solve the final task (e.g. Classification) with the remaining part of the Network (Fig.1 Stack 2 and 3).

Fig.1 : CNN Architecture for Classification as a Sequential Combination of 1) Convolutive Stack, 2) Fully Connected Stack and 3) Softmax Layer

Let’s not forget that in the Supervised Learning Scenario, what drives the “Evolution” of the Network (i.e. the Training) is the Error which is defined on the base of a Training Set, regarding a certain Task (see Note1)

The “Error Backpropagation” across the Network allows for a “Parameter specific Local Update” which means for each Parameter

  • Gradient Computation (that’s why I used the term local) (see Note2)
  • Step in the Gradient Descent Direction (choosing the Step Size is another problem)

That’s not so different from the Representation Learning performed by Autoencoders which are forced to a learn a Lower Dimensional Representation which is good enough to regenerate the Input Data according to some Quality Metric.

In any case, the Deep Network learns a Lower Dimensional Input Data Representation hence it implicitely defines a Lower Dimensional Feature Space which provides a Higher Level of Abstraction (see Note3) and automatize the critical Feature Engineering step.


Without the “Error” working as a feedback mechanism, there won’t be any “Evolutive Force” shaping the Network Params in the direction of improving the Network Inference. 
This is done according to a Similarity Measure (the Error Function) applied to Training Data and Inference Result.

Assuming all the Neurons Transfer Functions are Differentiable, the Gradient is well-defined 
However from a practical point of view it’s at this step that numerical stability issues (aka Vanishing and Exploding Gradient) could arise

Here the interpretability issues typically arise as the “Learned Semantic” by the Network is not easily understandable by the Human, nevertheless it allows the Deep Network to solve task being it a classification, detection, … or a data regeneration task.