Five Powerful CNN Architectures

Faisal Shahbaz
Nov 11, 2018 · 12 min read
Image for post
Image for post

Let’s go over some of the powerful Convolutional Neural Networks which laid the foundation of today’s Computer Vision achievements, achieved using Deep Learning.

LeNet-5 — LeCun et al

LeNet-5 — Architecture

The hand-written numbers were digitized into grayscale images of pixel size — 32×32. At that time, the computational capacity was limited and hence the technique wasn’t scalable to large scale images.

Let’s understand the architecture of the model. The model contained 7 layers excluding the input layer. Since it is a relatively small architecture, let’s go layer by layer:

  1. Layer 1: A convolutional layer with kernel size of 5×5, stride of 1×1 and 6 kernels in total. So the input image of size 32x32x1 gives an output of 28x28x6. Total params in layer = 5 * 5 * 6 + 6 (bias terms)
  2. Layer 2: A pooling layer with 2×2 kernel size, stride of 2×2 and 6 kernels in total. This pooling layer acted a little differently than what we discussed in previous post. The input values in the receptive were summed up and then were multiplied to a trainable parameter (1 per filter), the result was finally added to a trainable bias (1 per filter). Finally, sigmoid activation was applied to the output. So, the input from previous layer of size 28x28x6 gets sub-sampled to 14x14x6. Total params in layer = [1 (trainable parameter) + 1 (trainable bias)] * 6 = 12
  3. Layer 3: Similar to Layer 1, this layer is a convolutional layer with same configuration except it has 16 filters instead of 6. So the input from previous layer of size 14x14x6 gives an output of 10x10x16. Total params in layer = 5 * 5 * 16 + 16 = 416.
  4. Layer 4: Again, similar to Layer 2, this layer is a pooling layer with 16 filters this time around. Remember, the outputs are passed through sigmoid activation function. The input of size 10x10x16 from previous layer gets sub-sampled to 5x5x16. Total params in layer = (1 + 1) * 16 = 32
  5. Layer 5: This time around we have a convolutional layer with 5×5 kernel size and 120 filters. There is no need to even consider strides as the input size is 5x5x16 so we will get an output of 1x1x120. Total params in layer = 5 * 5 * 120 = 3000
  6. Layer 6: This is a dense layer with 84 parameters. So, the input of 120 units is converted to 84 units. Total params = 84 * 120 + 84 = 10164. The activation function used here was rather a unique one. I’ll say you can just try out any of your choice here as the task is pretty simple one by today’s standards.
  7. Output Layer: Finally, a dense layer with 10 units is used. Total params = 84 * 10 + 10 = 924.

Skipping over the details of loss function used and why it was used, I would suggest using cross-entropy loss with softmax activation in the last layer. Try out different training schedules and learning rates.

LeNet-5 — CODE

from keras import layers
from keras.models import Model

def lenet_5(in_shape=(32,32,1), n_classes=10, opt='sgd'):
in_layer = layers.Input(in_shape)
conv1 = layers.Conv2D(filters=20, kernel_size=5,
padding='same', activation='relu')(in_layer)
pool1 = layers.MaxPool2D()(conv1)
conv2 = layers.Conv2D(filters=50, kernel_size=5,
padding='same', activation='relu')(pool1)
pool2 = layers.MaxPool2D()(conv2)
flatten = layers.Flatten()(pool2)
dense1 = layers.Dense(500, activation='relu')(flatten)
preds = layers.Dense(n_classes, activation='softmax')(dense1)

model = Model(in_layer, preds)
model.compile(loss="categorical_crossentropy", optimizer=opt,
metrics=["accuracy"])
return model

if __name__ == '__main__':
model = lenet_5()
print(model.summary())

AlexNet — Krizhevsky et al

The network was very similar to LeNet but was much more deeper and had around 60 million parameters.

Image for post
Image for post
ImageNet Classification with Deep Convolutional Neural Networks

AlexNet — Architecture

Well that figure certainly looks scary. This is because the network was split into two halves, each trained simultaneously on two different GPUs. Let’s make this a little bit easy for us and bring a simpler version into the picture:

Image for post
Image for post

The architecture consists of 5 Convolutional Layers and 3 Fully Connected Layers. These 8 layers combined with two new concepts at that time — MaxPooling and ReLU activation gave their model an edge.

You can see the various layers and their configuration in the figure above. The layers are described in the table below:

Image for post
Image for post

Note: ReLU activation is applied to the output of every Convolution and Fully Connected layer except the last softmax layer.

Various other techniques were used by the authors (few of them will be discussed in upcoming posts) — dropout, augmentation and Stochastic Gradient Descent with momentum.

AlexNet — CODE

from keras import layers
from keras.models import Model

def alexnet(in_shape=(227,227,3), n_classes=1000, opt='sgd'):
in_layer = layers.Input(in_shape)
conv1 = layers.Conv2D(96, 11, strides=4, activation='relu')(in_layer)
pool1 = layers.MaxPool2D(3, 2)(conv1)
conv2 = layers.Conv2D(256, 5, strides=1, padding='same', activation='relu')(pool1)
pool2 = layers.MaxPool2D(3, 2)(conv2)
conv3 = layers.Conv2D(384, 3, strides=1, padding='same', activation='relu')(pool2)
conv4 = layers.Conv2D(256, 3, strides=1, padding='same', activation='relu')(conv3)
pool3 = layers.MaxPool2D(3, 2)(conv4)
flattened = layers.Flatten()(pool3)
dense1 = layers.Dense(4096, activation='relu')(flattened)
drop1 = layers.Dropout(0.5)(dense1)
dense2 = layers.Dense(4096, activation='relu')(drop1)
drop2 = layers.Dropout(0.5)(dense2)
preds = layers.Dense(n_classes, activation='softmax')(drop2)

model = Model(in_layer, preds)
model.compile(loss="categorical_crossentropy", optimizer=opt,
metrics=["accuracy"])
return model

if __name__ == '__main__':
model = alexnet()
print(model.summary())

VGGNet — Simonyan et al

In future posts, we will see how this network is one of the most used choices for feature extraction from images (taking images and converting them to a smaller dimensional array that contains important information regarding the image).

Image for post
Image for post

VGGNet — Architecture

VGGNet has 2 simple rules of thumb to be followed:

  1. Each Convolutional layer has configuration — kernel size = 3×3, stride = 1×1, padding = same. The only thing that differs is number of filters.
  2. Each Max Pooling layer has configuration — windows size = 2×2 and stride = 2×2. Thus, we half the size of the image at every Pooling layer.

The input image was an RGB image of 224×224 pixels. So input size = 224x224x3

Image for post
Image for post

Total Params = 138 million. Most of these parameters are contributed by fully connected layers.

  • The first FC layer contributes = 4096 * (7 * 7 * 512) + 4096 = 102,764,544
  • The second FC layer contributes = 4096 * 4096 + 4096 = 16,781,312
  • The third FC layer contributes = 4096 * 1000 + 4096 = 4,100,096

Total params contributed by FC layers = 123,645,952.

VGGNet — CODE

from keras import layers
from keras.models import Model, Sequential

from functools import partial

conv3 = partial(layers.Conv2D,
kernel_size=3,
strides=1,
padding='same',
activation='relu')

def block(in_tensor, filters, n_convs):
conv_block = in_tensor
for _ in range(n_convs):
conv_block = conv3(filters=filters)(conv_block)
return conv_block

def _vgg(in_shape=(227,227,3),
n_classes=1000,
opt='sgd',
n_stages_per_blocks=[2, 2, 3, 3, 3]):
in_layer = layers.Input(in_shape)

block1 = block(in_layer, 64, n_stages_per_blocks[0])
pool1 = layers.MaxPool2D()(block1)
block2 = block(pool1, 128, n_stages_per_blocks[1])
pool2 = layers.MaxPool2D()(block2)
block3 = block(pool2, 256, n_stages_per_blocks[2])
pool3 = layers.MaxPool2D()(block3)
block4 = block(pool3, 512, n_stages_per_blocks[3])
pool4 = layers.MaxPool2D()(block4)
block5 = block(pool4, 512, n_stages_per_blocks[4])
pool5 = layers.MaxPool2D()(block5)
flattened = layers.GlobalAvgPool2D()(pool5)

dense1 = layers.Dense(4096, activation='relu')(flattened)
dense2 = layers.Dense(4096, activation='relu')(dense1)
preds = layers.Dense(1000, activation='softmax')(dense2)

model = Model(in_layer, preds)
model.compile(loss="categorical_crossentropy", optimizer=opt,
metrics=["accuracy"])
return model

def vgg16(in_shape=(227,227,3), n_classes=1000, opt='sgd'):
return _vgg(in_shape, n_classes, opt)

def vgg19(in_shape=(227,227,3), n_classes=1000, opt='sgd'):
return _vgg(in_shape, n_classes, opt, [2, 2, 4, 4, 4])

if __name__ == '__main__':
model = vgg19()
print(model.summary())

GoogLeNet/Inception — Szegedy et al

Image for post
Image for post
Inception module

Reasons for using these inception modules:

  1. Each layer type extracts different information from input. Information gathered from a 3×3 layer will differ from information gathered from a 5×5 layer. How do we know which transformation will be the best at a given layer? So we use them all!
  2. Dimensionality reduction using 1×1 convolutions! Consider a 128x128x256 input. If we pass it through 20 filters of size 1×1, we will get an output of 128x128x20. So we apply them before the 3×3 or 5×5 convolutions to decrease the number of input filters to these layers in the inception block used for dimensionality reduction.

GoogLeNet/Inception — Architecture

The complete inception architecture:

You might see some “auxiliary classifiers” with softmax in this structure. Quoting paper here on this one — “By adding auxiliary classifiers connected to these intermediate layers, we would expect to encourage discrimination in the lower stages in the classifier, increase the gradient signal that gets propagated back, and provide additional regularization.”

But what does it mean? Basically what they meant by:

  1. discrimination in the lower stages: We will train lower layers in network with gradients coming in from an earlier staged layer for output probabilities. This makes sure that the network has some discrimination earlier on about different objects.
  2. increase the gradient signal that gets propagated back: In deep neural networks, often, the gradients flowing back (using backpropagation), become so small that the earlier layers of network hardly learn. The earlier classification layers thus make it helpful by propagating a strong gradient signal to train the network.
  3. provide additional regularization: Deep Neural Networks tend to overfit (or cause high variance) the data while small Neural Networks tend to underfit (or cause high bias). The earlier classifiers regularize overfitting effect of the deeper layers!

Structure of Auxiliary classifiers:

Image for post
Image for post

Note: Here,

#1×1 represents the filters in 1×1 convolution in inception module.

#3×3 reduce represents the filters in 1×1 convolution before 3×3 convolution in inception module.

#5×5 reduce represents the filters in 1×1 convolution before 5×5 convolution in inception module.

#3×3 represents the filters in 3×3 convolution in inception module.

#5×5 represents the filters in 5×5 convolution in inception module.

Pool Proj represents the filters in 1×1 convolution before Max Pool in inception module.

Image for post
Image for post
GoogLeNet incarnation of the Inception architecture

It used batch normalization, image distortions and RMSprop, things we will discuss in future posts.

GoogLeNet/Inception — CODE

from keras import layers
from keras.models import Model

from functools import partial

conv1x1 = partial(layers.Conv2D, kernel_size=1, activation='relu')
conv3x3 = partial(layers.Conv2D, kernel_size=3, padding='same', activation='relu')
conv5x5 = partial(layers.Conv2D, kernel_size=5, padding='same', activation='relu')

def inception_module(in_tensor, c1, c3_1, c3, c5_1, c5, pp):
conv1 = conv1x1(c1)(in_tensor)

conv3_1 = conv1x1(c3_1)(in_tensor)
conv3 = conv3x3(c3)(conv3_1)

conv5_1 = conv1x1(c5_1)(in_tensor)
conv5 = conv5x5(c5)(conv5_1)

pool_conv = conv1x1(pp)(in_tensor)
pool = layers.MaxPool2D(3, strides=1, padding='same')(pool_conv)

merged = layers.Concatenate(axis=-1)([conv1, conv3, conv5, pool])
return merged

def aux_clf(in_tensor):
avg_pool = layers.AvgPool2D(5, 3)(in_tensor)
conv = conv1x1(128)(avg_pool)
flattened = layers.Flatten()(conv)
dense = layers.Dense(1024, activation='relu')(flattened)
dropout = layers.Dropout(0.7)(dense)
out = layers.Dense(1000, activation='softmax')(dropout)
return out

def inception_net(in_shape=(224,224,3), n_classes=1000, opt='sgd'):
in_layer = layers.Input(in_shape)

conv1 = layers.Conv2D(64, 7, strides=2, activation='relu', padding='same')(in_layer)
pad1 = layers.ZeroPadding2D()(conv1)
pool1 = layers.MaxPool2D(3, 2)(pad1)
conv2_1 = conv1x1(64)(pool1)
conv2_2 = conv3x3(192)(conv2_1)
pad2 = layers.ZeroPadding2D()(conv2_2)
pool2 = layers.MaxPool2D(3, 2)(pad2)

inception3a = inception_module(pool2, 64, 96, 128, 16, 32, 32)
inception3b = inception_module(inception3a, 128, 128, 192, 32, 96, 64)
pad3 = layers.ZeroPadding2D()(inception3b)
pool3 = layers.MaxPool2D(3, 2)(pad3)

inception4a = inception_module(pool3, 192, 96, 208, 16, 48, 64)
inception4b = inception_module(inception4a, 160, 112, 224, 24, 64, 64)
inception4c = inception_module(inception4b, 128, 128, 256, 24, 64, 64)
inception4d = inception_module(inception4c, 112, 144, 288, 32, 48, 64)
inception4e = inception_module(inception4d, 256, 160, 320, 32, 128, 128)
pad4 = layers.ZeroPadding2D()(inception4e)
pool4 = layers.MaxPool2D(3, 2)(pad4)

aux_clf1 = aux_clf(inception4a)
aux_clf2 = aux_clf(inception4d)

inception5a = inception_module(pool4, 256, 160, 320, 32, 128, 128)
inception5b = inception_module(inception5a, 384, 192, 384, 48, 128, 128)
pad5 = layers.ZeroPadding2D()(inception5b)
pool5 = layers.MaxPool2D(3, 2)(pad5)

avg_pool = layers.GlobalAvgPool2D()(pool5)
dropout = layers.Dropout(0.4)(avg_pool)
preds = layers.Dense(1000, activation='softmax')(dropout)

model = Model(in_layer, [preds, aux_clf1, aux_clf2])
model.compile(loss="categorical_crossentropy", optimizer=opt,
metrics=["accuracy"])
return model

if __name__ == '__main__':
model = inception_net()
print(model.summary())

ResNet — Kaiming He et al

Image for post
Image for post
Residual learning: a building block.

The idea came out as a solution to an observation — Deep neural networks perform worse as we keep on adding layer. But intuitively speaking, this should not be the case. If a network with k layers performs as y, then a network with k+1 layers should at least perform y.

The observation brought about a hypothesis: direct mappings are hard to learn. So instead of learning mapping between output of layer and its input, learn the difference between them — learn the residual.

Say, x was the input and H(x) was the learnt output. So, we need to learn F(x) = H(x) — x. We can do this by first making a layer to learn F(x) and then adding x to F(x) hence achieving H(x). As a result, we are sending the same H(x) in next layer as we were supposed to before! This gave rise to the residual block we saw above.

The results were amazing as the vanishing gradients problem which usually make deep neural networks numb to learning were removed. How? The skip connections or the shortcuts, as we might say them, gave a shortcut to the gradients to the previous layers, skipping bunch of layers in between.

ResNet — Architecture

Image for post
Image for post

Let’s use it here:

Image for post
Image for post

The paper mentions the usage of bottleneck for deeper ResNets — 50/101/152. Instead of using the residual block mentioned above, the network uses 1×1 convolutions to increase and decrease dimensionality of the number of channels.

Image for post
Image for post

ResNet — CODE

from keras import layers
from keras.models import Model

def _after_conv(in_tensor):
norm = layers.BatchNormalization()(in_tensor)
return layers.Activation('relu')(norm)

def conv1(in_tensor, filters):
conv = layers.Conv2D(filters, kernel_size=1, strides=1)(in_tensor)
return _after_conv(conv)

def conv1_downsample(in_tensor, filters):
conv = layers.Conv2D(filters, kernel_size=1, strides=2)(in_tensor)
return _after_conv(conv)

def conv3(in_tensor, filters):
conv = layers.Conv2D(filters, kernel_size=3, strides=1, padding='same')(in_tensor)
return _after_conv(conv)

def conv3_downsample(in_tensor, filters):
conv = layers.Conv2D(filters, kernel_size=3, strides=2, padding='same')(in_tensor)
return _after_conv(conv)

def resnet_block_wo_bottlneck(in_tensor, filters, downsample=False):
if downsample:
conv1_rb = conv3_downsample(in_tensor, filters)
else:
conv1_rb = conv3(in_tensor, filters)
conv2_rb = conv3(conv1_rb, filters)

if downsample:
in_tensor = conv1_downsample(in_tensor, filters)
result = layers.Add()([conv2_rb, in_tensor])

return layers.Activation('relu')(result)

def resnet_block_w_bottlneck(in_tensor,
filters,
downsample=False,
change_channels=False):
if downsample:
conv1_rb = conv1_downsample(in_tensor, int(filters/4))
else:
conv1_rb = conv1(in_tensor, int(filters/4))
conv2_rb = conv3(conv1_rb, int(filters/4))
conv3_rb = conv1(conv2_rb, filters)

if downsample:
in_tensor = conv1_downsample(in_tensor, filters)
elif change_channels:
in_tensor = conv1(in_tensor, filters)
result = layers.Add()([conv3_rb, in_tensor])

return result

def _pre_res_blocks(in_tensor):
conv = layers.Conv2D(64, 7, strides=2, padding='same')(in_tensor)
conv = _after_conv(conv)
pool = layers.MaxPool2D(3, 2, padding='same')(conv)
return pool

def _post_res_blocks(in_tensor, n_classes):
pool = layers.GlobalAvgPool2D()(in_tensor)
preds = layers.Dense(n_classes, activation='softmax')(pool)
return preds

def convx_wo_bottleneck(in_tensor, filters, n_times, downsample_1=False):
res = in_tensor
for i in range(n_times):
if i == 0:
res = resnet_block_wo_bottlneck(res, filters, downsample_1)
else:
res = resnet_block_wo_bottlneck(res, filters)
return res

def convx_w_bottleneck(in_tensor, filters, n_times, downsample_1=False):
res = in_tensor
for i in range(n_times):
if i == 0:
res = resnet_block_w_bottlneck(res, filters, downsample_1, not downsample_1)
else:
res = resnet_block_w_bottlneck(res, filters)
return res

def _resnet(in_shape=(224,224,3),
n_classes=1000,
opt='sgd',
convx=[64, 128, 256, 512],
n_convx=[2, 2, 2, 2],
convx_fn=convx_wo_bottleneck):
in_layer = layers.Input(in_shape)

downsampled = _pre_res_blocks(in_layer)

conv2x = convx_fn(downsampled, convx[0], n_convx[0])
conv3x = convx_fn(conv2x, convx[1], n_convx[1], True)
conv4x = convx_fn(conv3x, convx[2], n_convx[2], True)
conv5x = convx_fn(conv4x, convx[3], n_convx[3], True)

preds = _post_res_blocks(conv5x, n_classes)

model = Model(in_layer, preds)
model.compile(loss="categorical_crossentropy", optimizer=opt,
metrics=["accuracy"])
return model

def resnet18(in_shape=(224,224,3), n_classes=1000, opt='sgd'):
return _resnet(in_shape, n_classes, opt)

def resnet34(in_shape=(224,224,3), n_classes=1000, opt='sgd'):
return _resnet(in_shape,
n_classes,
opt,
n_convx=[3, 4, 6, 3])

def resnet50(in_shape=(224,224,3), n_classes=1000, opt='sgd'):
return _resnet(in_shape,
n_classes,
opt,
[256, 512, 1024, 2048],
[3, 4, 6, 3],
convx_w_bottleneck)

def resnet101(in_shape=(224,224,3), n_classes=1000, opt='sgd'):
return _resnet(in_shape,
n_classes,
opt,
[256, 512, 1024, 2048],
[3, 4, 23, 3],
convx_w_bottleneck)

def resnet152(in_shape=(224,224,3), n_classes=1000, opt='sgd'):
return _resnet(in_shape,
n_classes,
opt,
[256, 512, 1024, 2048],
[3, 8, 36, 3],
convx_w_bottleneck)

if __name__ == '__main__':
model = resnet50()
print(model.summary())

References:

  1. Object Recognition with Gradient-Based Learning
  2. ImageNet Classification with Deep Convolutional Neural Networks
  3. Very Deep Convolutional Networks for Large-Scale Image Recognition
  4. Going deeper with convolutions
  5. Deep Residual Learning for Image Recognition

Data Driven Investor

empowering you with data, knowledge, and expertise

Sign up for DDIntel

By Data Driven Investor

In each issue we share the best stories from the Data-Driven Investor's expert community. Take a look

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Faisal Shahbaz

Written by

Software Engineer, Linux lover & Full Stack Developer. Passionate about AI — Machine Learning — Deep Learning.

Data Driven Investor

empowering you with data, knowledge, and expertise

Faisal Shahbaz

Written by

Software Engineer, Linux lover & Full Stack Developer. Passionate about AI — Machine Learning — Deep Learning.

Data Driven Investor

empowering you with data, knowledge, and expertise

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store