Know your Neural Network architecture more by understanding these terms

5 min readMay 16, 2023

--

https://www.researchgate.net/publication/351889219/figure/fig1/AS:1027883464142848@1622077923275/The-model-architecture-of-YOLO-where-the-backbone-extracts-features-from-an-image-the.png

Convolutional Neural Networks (CNNs) have changed computer vision and are widely used for Image Classification, Object Recognition, and Image Segmentation. Understanding model architecture of this CNN models might be difficult for beginners as they will find some unknown terms. I will list and describe some of these terms here , that I hope will be helpful for your journey in deep learning.

Backbone

The Backbone Network are commonly seen in the Object detection model architectures. Backbone is responsible for extracting and encoding features from the input data. It acts as the core feature extractor, capturing low-level and high-level features from the input data.

Beginners get Backbone model confused with Baseline model. So what exactly is Baseline model ? What is the difference between Baseline and Backbone ? Let us find out the difference !

Baseline

A baseline model is a simple model that serves as a starting point or point of comparison. The baseline model is often used as a baseline to evaluate the effectiveness of more advanced or complex models or techniques. It serves as a point of comparison to measure the improvement achieved by more advanced approaches.

Backbone vs Baseline

The baseline model is a simple reference model used for comparison, whereas the backbone model is a more complicated architecture responsible for feature extraction. The baseline model serves as a starting point for evaluation, while the backbone model offers the feature representation required for specific task solving.

Neck

Neck is responsible for further transforming and refining the features extracted by the backbone model. Its goal is to improve the backbone’s extracted features, and give more informative feature representations.

Backbone vs Neck

The backbone is responsible for the initial feature extraction from the input data, while the neck enhances and merge those features to improve the model’s performance.

Head

The head is made up of task-specific layers that are designed to produce the final prediction or inference based on the information extracted by the Backbone and Neck.

Bottleneck

In a neural network, the bottleneck is simply a layer with fewer neurons than the layer below or above it. The presence of such a layer let the network to compress feature representations to fit in the given space as best as possible.

https://qph.cf2.quoracdn.net/main-qimg-23264282c4634e252d8e97504632316f-pjlq

Residual Block

A residual block is a layer stack arranged so that the output of one layer is taken and added to another layer deeper in the block.

https://i.ytimg.com/vi/r0HvOIjziw4/maxresdefault.jpg

Skip Connection

Skip connections are a form of shortcut that connects the output of one layer to the input of an adjacent layer.

https://pub.mdpi-res.com/sensors/sensors-19-03929/article_deploy/html/images/sensors-19-03929-g001.png?1569436300

Residual Block vs Skip Connection

Skip connections are a general concept referring to any form of direct connections between layers, whereas residual connections are a specific type of skip connection commonly used in ResNet architectures. The main purpose of residual connections is to allow the network to learn changes rather than directly learning the whole transformation.

Downsampling

After extracting features from first layer of CNN , if we directly give the output to second layer then the process becomes computationally expensive. To reduce the size of output we extract the minimal effect features and give to the next layer. This process done to reduce the output size without loosing important information is called Downsampling. Most common approach use for Downsampling is Max pooling.

https://learnopencv.com/wp-content/uploads/2023/01/tensorflow-keras-cnn-vgg-architecture.png

Max pooling

Max Pooling is a pooling operation that calculates the maximum value for patches of a feature map, and uses it to create a downsampled feature map.

https://production-media.paperswithcode.com/methods/MaxpoolSample2.png

1 x 1 Convolution

What ? 1x1 convolution ? Convolution filter size are suppose to be greater than 1x1 so they can make sense ! right ?. Well ! Let us understand the 1x1 convolution and how using them would make sense !.

The depth of the input or the number of filters used in convolutional layers frequently increases with network depth, resulting in an increase in the number of feature maps. As seen in the following figure , the highlighted numbers are number of features after every layer which increases as 64 , 128 , 256 , 512. A large number of feature maps in a convolutional neural network can cause a problem as a convolutional operation must be performed down through the depth of the input.

Pooling layers are intended to downscale feature maps by halves their width and height across the network. But pooling layers have no effect on the model’s number of filters, depth, or number of channels.

To reduce the number of channels or depth of the model , 1 x 1 convolution technique is used. 1X1 Convolution simply means the filter is of size 1X1. This 1X1 filter will convolve over the ENTIRE input image pixel by pixel.

So for example , if input of 64X64X3 (height X width X channel) , if we choose a 1X1 filter (which would be 1X1X3), then the output will have the same height and width as input but only one channel 64X64X1.