AlexNet in a Nutshell

Ritik Dutta
Analytics Vidhya
Published in
6 min readMay 29, 2020

Hello welcome back to my blog, I am Ritik Dutta and in this blog I will show you the working and architecture of another CNN model AlexNet. we also see different technique used for the first time in AlexNet. at the last we see how to train own AlexNet model.

So, Why anyone care about another CNN model named AlexNet. Well if i tell you it was one of the biggest breakthrough in computer vision, don’t believe me come with me and let’s get started.

AlexNet was developed by Alex Krizhevsky, Ilya Sutskever and Geoffrey E. Hinton in 2012.

AlexNet was trained on ImageNet datasets consists of 1.2 million data(images) which contains 1000 different classes having 60 million parameters and 650000 neurons consist of 5 convolutional layers

AlexNet Architecture

First Layer (Conv1)

in the first layer we apply 96 kernals of size 11*11 with stride as 4 and no padding on this layer so we get input of 55*55*96

55 because, current input = (previous input-kernal size)/stride+(padding*2) = (227–11)/4 +(0*2)

as Activation ReLU was used

Second Layer

in this layer Overlapping MaxPooling was used of kernal size 3*3 with stride 2 so we get the input as 27 = (55–3)/2+1

With Overlapping it is slightly harder to overfit the model

Third Layer (Conv2)

Similarly as above, we have:

No. of kernals = 256

Kernal size = 5*5

Stride = 1

Padding = 2

Input = 27 = (27–5)/1 + 1 + (2*2)

As you have seen input size is same as previous layer it is because we used padding = 2 with stride = 1

by using padding as (kernal size-1)/2 with stride as 1 we get the same input as previous layer.

Fourth Layer

Kernal size = 3*3

Stride = 2

input = 13 = (27–3)/2+1

Fifth Layer (Conv3)

No. of kernals = 384

Kernal size = 3*3

Stride = 1

Padding = 1

Input = 13 = (13–3)/1+1+(2*1)

Sixth Layer (Conv4)

No. of kernals = 384

Kernal size = 3*3

Stride = 1

Padding = 1

Input = 13 = (13–3)/1+1+(2*1)

Seventh Layer (Conv5)

No. of kernals = 256

Kernal size = 3*3

Stride = 1

Padding = 1

Input = 13 = (13–3)/1+1+(2*1)

Eighth Layer

Kernal size = 3*3

Stride = 2

padding = 0

input = 5 = (13–3)/2+(2*0)

Ninth Layer, Tenth Layer

These layers are fully connected layers consist of 4096 neurons

Output Layer

Final layer consist of 1000 classes as output.

New Features Used in AlexNet

ReLU

as we know the complexity of tanh is more as compared to ReLU, as tanh is a non-linear function so it was better to use ReLU as activation function so training time will be low.

as recorded on the AlexNet training time ReLU was 6 times faster than tanh

MaxPooling2D

Training on Multiple GPUs

using a single GPU limiting the maxium size of network and 1.2 million records are too big for a single GPU so what they did is that they seperated the task over two GPU called GPU Parallelization

Local Response Normalization

also known as standardization of data

it was the first time that LRN was used, LRN was used to encourage the concept of lateral inhabitation. This concept comes form Neurology which states that lateral inhabitation is a capicity of neuron to reduce the activity of its neighbours. in terms of Deep Neural Networks it is used to carry out local contrast enhancement t so the locally maximum pixels values are used to excite next layers.

Overlapping Pooling

if we set S=Z (stride = keral size) we get traditional local pooling but if we set Z>S we get Overlapping Pooling and we would be able to abstract more freatures. ok lets understands it in simple language.

Suppose we have 1D list of elements shown in above figure and we have 2 parameters for Stride and Kernal size. so it can be clearly seen that if we set S=Z (left part) we get the same output from different lists but if we set Z>S we get different output(right part). it is because of information loss when S=Z which leads to shrinkage of data (from 3 datasets to 1) which leads to overfitting.

Some of the techniques also used

Data Augumentation

this technique used to increase diversity of data, this includes corpping, flipping, paning, brightning, etc. So when on testing if any image is unusually zoomed or in low light the model may predict it correctly.

Dropout

deep neural network tends to overfit the training data set more with few examples. but by using different models configuration chances of overfitting reduces but it also require more computational power as we need to train those extra models. But with dropout a single model can have different network architecture by randomly dropping out nodes during training.

How to train AlexNet Model

import warnings
warnings.filterwarnings(“ignore”)
import keras
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout, Flatten, Conv2D, MaxPooling2D
from keras.layers.normalization import BatchNormalization
import numpy as np
np.random.seed(1000)

Get Data

import tflearn.datasets.oxflower17 as oxflower17
x, y = oxflower17.load_data(one_hot=True)

Create a sequential mode

model = Sequential()

1st Convolutional Layer

model.add(Conv2D(filters=96, input_shape=(224,224,3), kernel_size=(11,11), strides=(4,4), padding=’valid’))
model.add(Activation(‘relu’))

Pooling

model.add(MaxPooling2D(pool_size=(3,3), strides=(2,2), padding=’valid’))

Batch Normalisation before passing it to the next layer

model.add(BatchNormalization())

2nd Convolutional Layer

model.add(Conv2D(filters=256, kernel_size=(11,11), strides=(1,1), padding=’valid’))
model.add(Activation(‘relu’))

Pooling

model.add(MaxPooling2D(pool_size=(3,3), strides=(2,2), padding=’valid’))

Batch Normalisation

model.add(BatchNormalization())

3rd Convolutional Layer

model.add(Conv2D(filters=384, kernel_size=(3,3), strides=(1,1), padding=’valid’))
model.add(Activation(‘relu’)

Batch Normalisation

model.add(BatchNormalization())

4th Convolutional Layer

model.add(Conv2D(filters=384, kernel_size=(3,3), strides=(1,1), padding=’valid’))
model.add(Activation(‘relu’))

Batch Normalisation

model.add(BatchNormalization())

5th Convolutional Layer

model.add(Conv2D(filters=256, kernel_size=(3,3), strides=(1,1), padding=’valid’))
model.add(Activation(‘relu’))

Pooling

model.add( MaxPooling2D(pool_size=(3,3), strides=(2,2), padding=’valid’))

Batch Normalisation

model.add(BatchNormalization())

Passing it to a dense layer

model.add(Flatten())

1st Dense Layer

model.add(Dense(4096, input_shape=(224*224*3,)))
model.add(Activation(‘relu’))

Add Dropout to prevent overfitting

model.add(Dropout(0.4))

Batch Normalisation

model.add(BatchNormalization())

2nd Dense Layer

model.add(Dense(4096))
model.add(Activation(‘relu’))

and compare accuracy with different activation functionsand compare accuracy with different activation functionsAdd Dropout

model.add(Dropout(0.4))

Batch Normalisation

model.add(BatchNormalization())

3rd Dense Layer

model.add(Dense(1000))
model.add(Activation(‘relu’))

Add Dropout

model.add(Dropout(0.4))

Batch Normalisation

model.add(BatchNormalization())

Output Layer

model.add(Dense(17))
model.add(Activation(‘softmax’))

model.summary()

Compile

model.compile(loss=’categorical_crossentropy’, optimizer=’adam’,\
metrics=[‘accuracy’])

(5) Train

model.fit(x, y, batch_size=64, epochs=5, verbose=0, validation_split=0.2, shuffle=True)

score = model.evaluate(x, y)
print(‘Test Loss:’, score[0])
print(‘Test accuracy:’, score[1])

as we can see that the aquiracy is quite well, u can also try it with tanh activation function.

Model summary

Model: "sequential_2"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d_6 (Conv2D) (None, 54, 54, 96) 34944
_________________________________________________________________
activation_10 (Activation) (None, 54, 54, 96) 0
_________________________________________________________________
max_pooling2d_4 (MaxPooling2 (None, 27, 27, 96) 0
_________________________________________________________________
batch_normalization_9 (Batch (None, 27, 27, 96) 384
_________________________________________________________________
conv2d_7 (Conv2D) (None, 17, 17, 256) 2973952
_________________________________________________________________
activation_11 (Activation) (None, 17, 17, 256) 0
_________________________________________________________________
max_pooling2d_5 (MaxPooling2 (None, 8, 8, 256) 0
_________________________________________________________________
batch_normalization_10 (Batc (None, 8, 8, 256) 1024
_________________________________________________________________
conv2d_8 (Conv2D) (None, 6, 6, 384) 885120
_________________________________________________________________
activation_12 (Activation) (None, 6, 6, 384) 0
_________________________________________________________________
batch_normalization_11 (Batc (None, 6, 6, 384) 1536
_________________________________________________________________
conv2d_9 (Conv2D) (None, 4, 4, 384) 1327488
_________________________________________________________________
activation_13 (Activation) (None, 4, 4, 384) 0
_________________________________________________________________
batch_normalization_12 (Batc (None, 4, 4, 384) 1536
_________________________________________________________________
conv2d_10 (Conv2D) (None, 2, 2, 256) 884992
_________________________________________________________________
activation_14 (Activation) (None, 2, 2, 256) 0
_________________________________________________________________
max_pooling2d_6 (MaxPooling2 (None, 1, 1, 256) 0
_________________________________________________________________
batch_normalization_13 (Batc (None, 1, 1, 256) 1024
__________________________________and compare accuracy with different activation functionsand compare accuracy with different activation functionsand compare accuracy with different activation functions_______________________________
flatten_2 (Flatten) (None, 256) 0
_________________________________________________________________
dense_5 (Dense) (None, 4096) 1052672
_________________________________________________________________
activation_15 (Activation) (None, 4096) 0
_________________________________________________________________
dropout_4 (Dropout) (None, 4096) 0
_________________________________________________________________
batch_normalization_14 (Batc (None, 4096) 16384
_________________________________________________________________
dense_6 (Dense) (None, 4096) 16781312
_________________________________________________________________
activation_16 (Activation) (None, 4096) 0
_________________________________________________________________
dropout_5 (Dropout) (None, 4096) 0
_________________________________________________________________
batch_normalization_15 (Batc (None, 4096) 16384
_________________________________________________________________
dense_7 (Dense) (None, 1000) 4097000
_________________________________________________________________
activation_17 (Activation) (None, 1000) 0
_________________________________________________________________
dropout_6 (Dropout) (None, 1000) 0
_________________________________________________________________
batch_normalization_16 (Batc (None, 1000) 4000
_________________________________________________________________
dense_8 (Dense) (None, 17) 17017
_________________________________________________________________
activation_18 (Activation) (None, 17) 0
=================================================================
Total params: 28,096,769
Trainable params: 28,075,633
Non-trainable params: 21,136
_________________________________________________________________
Train on 1088 samples, validate on 272 samples

so that is it for this blog hope you like it. have a great day ahead

--

--