Everything you need to know about VGG16

17 min readSep 23, 2021

Author: Rohini G

Introduction

This blog will give you an insight into VGG16 architecture and explain the same using a use-case for object detection.

ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) is an annual event to showcase and challenge computer vision models. In the 2014 ImageNet challenge, Karen Simonyan & Andrew Zisserman from Visual Geometry Group, Department of Engineering Science, University of Oxford showcased their model in the paper titled “VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION,” which won the 1st and 2nd place in object detection and classification. The original paper can be downloaded from the below link:

1409.1556.pdf (arxiv.org)

What is VGG16

A convolutional neural network is also known as a ConvNet, which is a kind of artificial neural network. A convolutional neural network has an input layer, an output layer, and various hidden layers. VGG16 is a type of CNN (Convolutional Neural Network) that is considered to be one of the best computer vision models to date. The creators of this model evaluated the networks and increased the depth using an architecture with very small (3 × 3) convolution filters, which showed a significant improvement on the prior-art configurations. They pushed the depth to 16–19 weight layers making it approx — 138 trainable parameters.

What is VGG16 used for

VGG16 is object detection and classification algorithm which is able to classify 1000 images of 1000 different categories with 92.7% accuracy. It is one of the popular algorithms for image classification and is easy to use with transfer learning.

VGG16 Architecture

The 16 in VGG16 refers to 16 layers that have weights. In VGG16 there are thirteen convolutional layers, five Max Pooling layers, and three Dense layers which sum up to 21 layers but it has only sixteen weight layers i.e., learnable parameters layer.
VGG16 takes input tensor size as 224, 244 with 3 RGB channel
Most unique thing about VGG16 is that instead of having a large number of hyper-parameters they focused on having convolution layers of 3x3 filter with stride 1 and always used the same padding and maxpool layer of 2x2 filter of stride 2.
The convolution and max pool layers are consistently arranged throughout the whole architecture
Conv-1 Layer has 64 number of filters, Conv-2 has 128 filters, Conv-3 has 256 filters, Conv 4 and Conv 5 has 512 filters.
Three Fully-Connected (FC) layers follow a stack of convolutional layers: the first two have 4096 channels each, the third performs 1000-way ILSVRC classification and thus contains 1000 channels (one for each class). The final layer is the soft-max layer.

Train vgg16 from scratch

Let us use the Stanford cars dataset as a use case for object detection and classification. You can download the dataset from the link below.

https://ai.stanford.edu/~jkrause/cars/car_dataset.html

Once you have downloaded the images then we can proceed with the steps written below.

Import the necessary libraries

%tensorflow_version 2.x

import tensorflow as tf

import cv2

import numpy as np

import os

from keras. preprocessing. Image import ImageDataGenerator

from keras.layers import Dense,Flatten,Conv2D,Activation,Dropout

from keras import backend as K

import keras

from keras.models import Sequential, Model

from keras.models import load_model

from keras.optimizers import SGD

from keras.callbacks import EarlyStopping,ModelCheckpoint

from keras.layers import MaxPool2D

from google.colab.patches import cv2_imshow

Detailed EDA (exploratory data analysis) has to be performed under the dataset. Check the image size/dimension and see if there are any images with too big or too small sizes that are outliers. Check other image-related EDA attributes like blurriness, whiteness, aspect ratio, etc.

Import the dataset and normalize the data to make it suitable for the VGG16 model to understand. The Stanford car dataset has cars of various sizes, pixel values, and dimensions. We change the image input tensor to 224, which the VGG16 model uses. The objective of ImageDataGenerator is to import data with labels easily into the model. It is a very useful class as it has many functions to rescale, rotate, zoom, flip, etc. The most useful thing about this class is that it doesn’t affect the data stored on the disk. This class alters the data on the go while passing it to the model. The ImageDataGenerator will automatically label all the data inside the folder. In this way, data is easily ready to be passed to the neural network.

train_datagen = ImageDataGenerator(zoom_range=0.15,width_shift_range=0.2,height_shift_range=0.2,shear_range=0.15)

test_datagen = ImageDataGenerator()

train_generator = train_datagen.flow_from_directory(“/content/drive/MyDrive/Rohini_Capstone/Car Images/Train Images”,target_size=(224, 224),batch_size=32,shuffle=True,class_mode=’categorical’)

test_generator = test_datagen.flow_from_directory(“/content/drive/MyDrive/Rohini_Capstone/Car Images/Test Images/”,target_size=(224,224),batch_size=32,shuffle=False,class_mode=’categorical’)

Define the VGG16 model as sequential model

→ 2 x convolution layer of 64 channel of 3x3 kernal and same padding