AlexNet in a Nutshell
Hello welcome back to my blog, I am Ritik Dutta and in this blog I will show you the working and architecture of another CNN model AlexNet. we also see different technique used for the first time in AlexNet. at the last we see how to train own AlexNet model.
So, Why anyone care about another CNN model named AlexNet. Well if i tell you it was one of the biggest breakthrough in computer vision, don’t believe me come with me and let’s get started.
AlexNet was developed by Alex Krizhevsky, Ilya Sutskever and Geoffrey E. Hinton in 2012.
AlexNet was trained on ImageNet datasets consists of 1.2 million data(images) which contains 1000 different classes having 60 million parameters and 650000 neurons consist of 5 convolutional layers
AlexNet Architecture
First Layer (Conv1)
in the first layer we apply 96 kernals of size 11*11 with stride as 4 and no padding on this layer so we get input of 55*55*96
55 because, current input = (previous input-kernal size)/stride+(padding*2) = (227–11)/4 +(0*2)
as Activation ReLU was used
Second Layer
in this layer Overlapping MaxPooling was used of kernal size 3*3 with stride 2 so we get the input as 27 = (55–3)/2+1
With Overlapping it is slightly harder to overfit the model
Third Layer (Conv2)
Similarly as above, we have:
No. of kernals = 256
Kernal size = 5*5
Stride = 1
Padding = 2
Input = 27 = (27–5)/1 + 1 + (2*2)
As you have seen input size is same as previous layer it is because we used padding = 2 with stride = 1
by using padding as (kernal size-1)/2 with stride as 1 we get the same input as previous layer.
Fourth Layer
Kernal size = 3*3
Stride = 2
input = 13 = (27–3)/2+1
Fifth Layer (Conv3)
No. of kernals = 384
Kernal size = 3*3
Stride = 1
Padding = 1
Input = 13 = (13–3)/1+1+(2*1)
Sixth Layer (Conv4)
No. of kernals = 384
Kernal size = 3*3
Stride = 1
Padding = 1
Input = 13 = (13–3)/1+1+(2*1)
Seventh Layer (Conv5)
No. of kernals = 256
Kernal size = 3*3
Stride = 1
Padding = 1
Input = 13 = (13–3)/1+1+(2*1)
Eighth Layer
Kernal size = 3*3
Stride = 2
padding = 0
input = 5 = (13–3)/2+(2*0)
Ninth Layer, Tenth Layer
These layers are fully connected layers consist of 4096 neurons
Output Layer
Final layer consist of 1000 classes as output.
New Features Used in AlexNet
ReLU
as we know the complexity of tanh is more as compared to ReLU, as tanh is a non-linear function so it was better to use ReLU as activation function so training time will be low.
as recorded on the AlexNet training time ReLU was 6 times faster than tanh
Training on Multiple GPUs
using a single GPU limiting the maxium size of network and 1.2 million records are too big for a single GPU so what they did is that they seperated the task over two GPU called GPU Parallelization
Local Response Normalization
also known as standardization of data
it was the first time that LRN was used, LRN was used to encourage the concept of lateral inhabitation. This concept comes form Neurology which states that lateral inhabitation is a capicity of neuron to reduce the activity of its neighbours. in terms of Deep Neural Networks it is used to carry out local contrast enhancement t so the locally maximum pixels values are used to excite next layers.
Overlapping Pooling
if we set S=Z (stride = keral size) we get traditional local pooling but if we set Z>S we get Overlapping Pooling and we would be able to abstract more freatures. ok lets understands it in simple language.
Suppose we have 1D list of elements shown in above figure and we have 2 parameters for Stride and Kernal size. so it can be clearly seen that if we set S=Z (left part) we get the same output from different lists but if we set Z>S we get different output(right part). it is because of information loss when S=Z which leads to shrinkage of data (from 3 datasets to 1) which leads to overfitting.
Some of the techniques also used
Data Augumentation
this technique used to increase diversity of data, this includes corpping, flipping, paning, brightning, etc. So when on testing if any image is unusually zoomed or in low light the model may predict it correctly.
Dropout
deep neural network tends to overfit the training data set more with few examples. but by using different models configuration chances of overfitting reduces but it also require more computational power as we need to train those extra models. But with dropout a single model can have different network architecture by randomly dropping out nodes during training.
How to train AlexNet Model
import warnings
warnings.filterwarnings(“ignore”)
import keras
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout, Flatten, Conv2D, MaxPooling2D
from keras.layers.normalization import BatchNormalization
import numpy as np
np.random.seed(1000)
Get Data
import tflearn.datasets.oxflower17 as oxflower17
x, y = oxflower17.load_data(one_hot=True)
Create a sequential mode
model = Sequential()
1st Convolutional Layer
model.add(Conv2D(filters=96, input_shape=(224,224,3), kernel_size=(11,11), strides=(4,4), padding=’valid’))
model.add(Activation(‘relu’))
Pooling
model.add(MaxPooling2D(pool_size=(3,3), strides=(2,2), padding=’valid’))
Batch Normalisation before passing it to the next layer
model.add(BatchNormalization())
2nd Convolutional Layer
model.add(Conv2D(filters=256, kernel_size=(11,11), strides=(1,1), padding=’valid’))
model.add(Activation(‘relu’))
Pooling
model.add(MaxPooling2D(pool_size=(3,3), strides=(2,2), padding=’valid’))
Batch Normalisation
model.add(BatchNormalization())
3rd Convolutional Layer
model.add(Conv2D(filters=384, kernel_size=(3,3), strides=(1,1), padding=’valid’))
model.add(Activation(‘relu’)
Batch Normalisation
model.add(BatchNormalization())
4th Convolutional Layer
model.add(Conv2D(filters=384, kernel_size=(3,3), strides=(1,1), padding=’valid’))
model.add(Activation(‘relu’))
Batch Normalisation
model.add(BatchNormalization())
5th Convolutional Layer
model.add(Conv2D(filters=256, kernel_size=(3,3), strides=(1,1), padding=’valid’))
model.add(Activation(‘relu’))
Pooling
model.add( MaxPooling2D(pool_size=(3,3), strides=(2,2), padding=’valid’))
Batch Normalisation
model.add(BatchNormalization())
Passing it to a dense layer
model.add(Flatten())
1st Dense Layer
model.add(Dense(4096, input_shape=(224*224*3,)))
model.add(Activation(‘relu’))
Add Dropout to prevent overfitting
model.add(Dropout(0.4))
Batch Normalisation
model.add(BatchNormalization())
2nd Dense Layer
model.add(Dense(4096))
model.add(Activation(‘relu’))
and compare accuracy with different activation functionsand compare accuracy with different activation functionsAdd Dropout
model.add(Dropout(0.4))
Batch Normalisation
model.add(BatchNormalization())
3rd Dense Layer
model.add(Dense(1000))
model.add(Activation(‘relu’))
Add Dropout
model.add(Dropout(0.4))
Batch Normalisation
model.add(BatchNormalization())
Output Layer
model.add(Dense(17))
model.add(Activation(‘softmax’))model.summary()
Compile
model.compile(loss=’categorical_crossentropy’, optimizer=’adam’,\
metrics=[‘accuracy’])
(5) Train
model.fit(x, y, batch_size=64, epochs=5, verbose=0, validation_split=0.2, shuffle=True)
score = model.evaluate(x, y)
print(‘Test Loss:’, score[0])
print(‘Test accuracy:’, score[1])
as we can see that the aquiracy is quite well, u can also try it with tanh activation function.
Model summary
Model: "sequential_2"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d_6 (Conv2D) (None, 54, 54, 96) 34944
_________________________________________________________________
activation_10 (Activation) (None, 54, 54, 96) 0
_________________________________________________________________
max_pooling2d_4 (MaxPooling2 (None, 27, 27, 96) 0
_________________________________________________________________
batch_normalization_9 (Batch (None, 27, 27, 96) 384
_________________________________________________________________
conv2d_7 (Conv2D) (None, 17, 17, 256) 2973952
_________________________________________________________________
activation_11 (Activation) (None, 17, 17, 256) 0
_________________________________________________________________
max_pooling2d_5 (MaxPooling2 (None, 8, 8, 256) 0
_________________________________________________________________
batch_normalization_10 (Batc (None, 8, 8, 256) 1024
_________________________________________________________________
conv2d_8 (Conv2D) (None, 6, 6, 384) 885120
_________________________________________________________________
activation_12 (Activation) (None, 6, 6, 384) 0
_________________________________________________________________
batch_normalization_11 (Batc (None, 6, 6, 384) 1536
_________________________________________________________________
conv2d_9 (Conv2D) (None, 4, 4, 384) 1327488
_________________________________________________________________
activation_13 (Activation) (None, 4, 4, 384) 0
_________________________________________________________________
batch_normalization_12 (Batc (None, 4, 4, 384) 1536
_________________________________________________________________
conv2d_10 (Conv2D) (None, 2, 2, 256) 884992
_________________________________________________________________
activation_14 (Activation) (None, 2, 2, 256) 0
_________________________________________________________________
max_pooling2d_6 (MaxPooling2 (None, 1, 1, 256) 0
_________________________________________________________________
batch_normalization_13 (Batc (None, 1, 1, 256) 1024
__________________________________and compare accuracy with different activation functionsand compare accuracy with different activation functionsand compare accuracy with different activation functions_______________________________
flatten_2 (Flatten) (None, 256) 0
_________________________________________________________________
dense_5 (Dense) (None, 4096) 1052672
_________________________________________________________________
activation_15 (Activation) (None, 4096) 0
_________________________________________________________________
dropout_4 (Dropout) (None, 4096) 0
_________________________________________________________________
batch_normalization_14 (Batc (None, 4096) 16384
_________________________________________________________________
dense_6 (Dense) (None, 4096) 16781312
_________________________________________________________________
activation_16 (Activation) (None, 4096) 0
_________________________________________________________________
dropout_5 (Dropout) (None, 4096) 0
_________________________________________________________________
batch_normalization_15 (Batc (None, 4096) 16384
_________________________________________________________________
dense_7 (Dense) (None, 1000) 4097000
_________________________________________________________________
activation_17 (Activation) (None, 1000) 0
_________________________________________________________________
dropout_6 (Dropout) (None, 1000) 0
_________________________________________________________________
batch_normalization_16 (Batc (None, 1000) 4000
_________________________________________________________________
dense_8 (Dense) (None, 17) 17017
_________________________________________________________________
activation_18 (Activation) (None, 17) 0
=================================================================
Total params: 28,096,769
Trainable params: 28,075,633
Non-trainable params: 21,136
_________________________________________________________________
Train on 1088 samples, validate on 272 samples
so that is it for this blog hope you like it. have a great day ahead