Top 5 CNN Architectures(GoogleNet,ResNet, DenseNet,AlexNet and VGGNet) to build your computer vision model

Fully explained with different versions

Mukhriddin Malik
8 min readJan 8, 2024

Large Scale Visual Recognition Challenge (ILSVRC) — was first started in 2010. It was organized by the ImageNet team and aimed to encourage advancements in the field of computer vision and object recognition. It has been turned into an annual competition in the field of computer vision. The challenge was introduced to encourage the development and benchmarking of algorithms for various computer vision tasks using a large dataset of images.

Large Scale Visual Recognition Challenge (ILSVRC)
Architectures with the best result in ILSVRC

GoogleNet(Inception module)

Inception movie

Inception module is one of the famous architectures used in Convolutional Neural Networks which was introduced by Google in the ImageNet Large-Scale Visual Recognition Challenge 2014(ILSVRC14) to solve the problem of computational expense, as well as overfitting, among other issues of image claasification and object detection.And this module became the winner of the competition that same year.With about 6.67% error rate among top -5 that signifies the percentage of test images where the correct label didn’t appear within the top 5 predictions generated by the model.

ARCHITECTURE

Model Initialization

  • The GoogleNet is initialized with the input shape and the number of output classes (classes). It also defines the name of the model.

Architecture

  • The model starts with a series of convolutional, batch normalization, and ReLU layers to process the input data.
  • It then includes max-pooling layers to downsample the spatial dimensions of the feature maps.
  • The Inception modules (inception_layer) are used to capture features at different scales and complexities.
  • After the Inception modules, there are classifier layers (classifier_layer) connected to different Inception stages (inception_4a and inception_4d).
  • The model ends with additional layers, including batch normalization, ReLU activation, average pooling, dropout, and a dense (fully connected) layer for classification.

Model Building

  • The GoogleNet class builds three separate models:
  • inception_model: The main model that produces the final output.
Inception layer

— A classifier model connected to one of the Inception stages to predict to check overfitting or underfitting and predict the best option to the dataset.

Classification blocks
The structure of auxilary classification block

Furthermore if you want to see the code from scratch you can visit my github repo.

Different version and improvements

Totally, there are 12,443,648 parameters in 22 convolutional and 5 pooling layers. The number of parameters and layers differ from each other in different versions such as , Inception v1,Inception v2,Inception v3,Inception v4 , Inception-ResNet v1,Inception-ResNet v2 and each refining and enhancing the original concept.

Resudial Network(ResNet)

ResNet, short for Residual Network, is a deep learning architecture that introduced the concept of residual learning. It was proposed by Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun in the paper “Deep Residual Learning for Image Recognition” in 2015.

The key innovation in ResNet is the introduction of skip connections or shortcut connections, which allow the network to bypass one or more layers. These connections enable the flow of information from earlier layers to later ones, helping to mitigate the degradation problem encountered in training deep networks.

Architecture

Each ResNet block consists of several convolutional layers followed by identity shortcut connections. These blocks can be stacked together, allowing for the creation of extremely deep networks — hundreds of layers deep — while still being relatively simple to optimize during training.

The architecture introduced residual learning, where the network learns residual functions rather than trying to learn the desired underlying mapping directly. The residual function is the difference between the input and the output of a block. By learning these residual functions, the network can focus on learning the residual information, which tends to be easier to optimize.

Different version and improvements

ResNet has various versions with different numbers of layers (e.g., ResNet-18, ResNet-50, ResNet-101, ResNet-152). Deeper versions generally perform better on complex tasks but require more computational resources.ResNet’s contributions have extended beyond image recognition, finding applications in various fields of deep learning due to its ability to effectively train very deep neural networks.

DenseNet

Densely connected convolutional networks is one of our top 5 architectures . It was created by Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger associated with the Cornell University.

DenseNet is built upon the idea of densely connecting layers within a block. Unlike traditional convolutional neural networks where each layer is connected only to the subsequent layers, DenseNet connects each layer to every other layer in a feed-forward fashion within a dense block. This connectivity pattern results in dense feature reuse and encourages feature propagation throughout the network.

Architecture

The dense connectivity in DenseNet alleviates the vanishing gradient problem by facilitating the flow of gradients throughout the network, enabling easier training of very deep networks. Additionally, DenseNet architectures typically use smaller and more efficient models by reducing the number of parameters compared to other networks like ResNet, while maintaining competitive performance.

DenseNet introduces the concept of concatenating feature maps from different layers rather than adding or averaging them as in some other architectures. Each layer receives feature maps from all preceding layers in the block, leading to rich feature representations being passed along through the network.

Different version and improvements

DenseNet architectures come in various forms, such as DenseNet-121, DenseNet-169, DenseNet-201, etc., denoting the depth of the network. These models have shown impressive performance in tasks like image classification, object detection, and segmentation, among others, due to their efficient use of parameters and feature reusability across layers.

AlexNet

AlexNet was Developed by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, AlexNet gained attention after winning the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2012. It was a deeper CNN compared to previous models, utilizing convolutional layers, max-pooling, ReLU activations, and dropout layers. AlexNet played a pivotal role in popularizing deep learning in computer vision.

AlexNet not only made an impact through its specific design but also catalyzed a broader recognition of the potential of deep neural networks in computer vision. Its triumph spurred subsequent research, innovation, and progress in artificial intelligence and image comprehension.

Architecture

In a departure from traditional activation functions, AlexNet embraced the rectified linear unit (ReLU), enhancing training speed by introducing non-linearity and mitigating the vanishing gradient issue. AlexNet used local response normalization (LRN) in certain layers to normalize responses, contributing to the model’s ability to generalize. To combat overfitting, AlexNet integrated dropout — a regularization technique that selectively deactivates neurons during training to discourage co-dependency among them.AlexNet heavily relied on GPU computation, enabling accelerated training compared to traditional CPU-based methods.

AlexNet’s victory in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2012 marked a milestone, significantly advancing the state-of-the-art in image classification and showcasing the prowess of deep learning and CNNs in intricate visual tasks.

Different version and improvements

Following the groundbreaking success of AlexNet, several variations and improvements have emerged, aiming to refine its architecture, enhance performance, and tackle various challenges in computer vision tasks. Some notable versions and enhancements include:

1.ZFNet: Developed by Matthew Zeiler and Rob Fergus in 2013, ZFNet made modifications to the original AlexNet architecture by adjusting filter sizes and strides in some layers. These alterations aimed to improve the visualizations of learned features and further enhance performance on image recognition tasks.

2.Overfeat: Overfeat, introduced in 2013, integrated both convolutional layers and fully connected layers in a single network, enabling it to perform object detection and localization tasks alongside image classification.

3.VGGNet: While not a direct evolution of AlexNet, VGGNet, proposed by the Visual Geometry Group at the University of Oxford in 2014, adopted a similar architecture with deeper networks consisting of 16 or 19 weight layers. VGGNet’s uniform structure with smaller filter sizes (3x3) in all layers contributed to its simplicity and effectiveness.

VGGNet( Visual Geometry Group Network)

A significant convolutional neural network (CNN) architecture developed by the Visual Geometry Group at the University of Oxford. It was introduced in 2014 by Simonyan and Zisserman.

VGGNet is characterized by its straightforward and uniform architecture. It consists of a series of convolutional layers, followed by max-pooling layers, and concludes with fully connected layers for classification.

Architecture

Throughout the network, VGGNet employs small 3x3 convolutional filters with a stride of 1 and a padding of 1. This uniform use of small filters allows for a deeper network while maintaining a simpler structure.VGGNet uses max-pooling layers with a 2x2 window and a stride of 2 after every two convolutional layers, reducing the spatial dimensions of the feature maps while retaining essential information.Rectified Linear Units (ReLU) serve as the activation function throughout the network, introducing non-linearity and aiding in the convergence of the training process.Following the convolutional layers, VGGNet ends with fully connected layers for classification, employing softmax activation to generate class probabilities.

VGGNet’s simplicity, uniform architecture, and use of small filters contributed to its ease of understanding and implementation. While it didn’t introduce major architectural innovations like residual connections in ResNet or inception modules in GoogLeNet, VGGNet served as a baseline model and benchmark for deeper CNNs. Its design principles and performance on image classification tasks have had a lasting impact on the development of convolutional neural networks.

Different version and improvements

VGGNet comes in variations with different depths, commonly known as VGG16 and VGG19. VGG16 consists of 16 weight layers (13 convolutional and 3 fully connected), while VGG19 is deeper, with 19 weight layers (16 convolutional and 3 fully connected). Some adaptations involve modifying the depth or number of filters within VGG-like architectures to suit specific requirements or computational constraints. For instance, creating shallower versions or altering the number of channels in the convolutional layers.

While there hasn’t been a labeled sequence of improvements or direct versions as with some other architectures like ResNet or Inception, VGGNet’s principles have influenced subsequent developments in deep learning architectures and have been adapted, fine-tuned, or employed as fundamental components in various computer vision tasks and frameworks.

Codes using these architectures are here :

https://github.com/Mukhriddin19980901

--

--

Mukhriddin Malik

Computer Vision Engineer, Kaggler ,Talks about Artificial Intelligence& Data Science,