AlexNet Complete Architecture

Atulanand
CodeX
Published in
5 min readNov 8, 2022

Introduction

AlexNet was designed by Hinton, winner of the 2012 ImageNet competition, and his student Alex Krizhevsky. It was also after that year that more and deeper neural networks were proposed, such as the excellent VGG, GoogleLeNet. Its official data model has an accuracy rate of 57.1% and top 1–5 reaches 80.2%. This is already quite outstanding for traditional machine learning classification algorithms.

The following table below explains the network structure of AlexNet:

Why does AlexNet achieve better results?

  1. Relu activation function is used:

The main reason why ReLu is used is because it is simple, fast, and empirically it seems to work well. Empirically, early papers observed that training a deep network with ReLu tended to converge much more quickly and reliably than training a deep network with sigmoid activation.

Relu function: f (x) = max (0, x)

ReLu is a non-linear activation function that is used in multi-layer neural networks or deep neural networks. This function can be represented as: where x = an input value. According to equation , the output of ReLu is the maximum value between zero and the input value.

2. Standardization (Local Response Normalization):

For in depth and good understanding of LRN please refer: https://towardsdatascience.com/difference-between-local-response-normalization-and-batch-normalization-272308c034ac

After using ReLU f (x) = max (0, x), you will find that the value after the activation function has no range like the tanh and sigmoid functions, so a normalization will usually be done after ReLU, and the LRU is a steady proposal (Not sure here, it should be proposed?) One method in neuroscience is called “Lateral inhibition”, which talks about the effect of active neurons on its surrounding neurons.

The Equation below refers Inter-Channel LRN:

3. Dropout:

Dropout is also a concept often said, which can effectively prevent overfitting of neural networks. Compared to the general linear model, a regular method is used to prevent the model from overfitting. In the neural network, Dropout is implemented by modifying the structure of the neural network itself. For a certain layer of neurons, randomly delete some neurons with a defined probability, while keeping the individuals of the input layer and output layer neurons unchanged, and then update the parameters according to the learning method of the neural network. In the next iteration, rerandomize, remove some neurons until the end of training.

4. Data Augmentation:

For more depth understanding: https://medium.com/lansaar/what-is-data-augmentation-3da1373e3fa1

In deep learning, when the amount of data is not large enough, there are generally 4 solutions:

  • Data augmentation- artificially increase the size of the training set-create a batch of “new” data from existing data by means of translation, flipping, noise
  • Regularization — — The relatively small amount of data will cause the model to overfit, making the training error small and the test error particularly large. By adding a regular term after the Loss Function , the overfitting can be suppressed. The disadvantage is that a need is introduced Manually adjusted hyper-parameter.
  • Dropout- also a regularization method. But different from the above, it is achieved by randomly setting the output of some neurons to zero
  • Unsupervised Pre-training- use Auto-Encoder or RBM’s convolution form to do unsupervised pre-training layer by layer, and finally add a classification layer to do supervised Fine-Tuning

Code Implementation:

!pip install tflearnimport keras
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout, Flatten,\
Conv2D, MaxPooling2D
from keras.layers.normalization import BatchNormalization
import numpy as np
np.random.seed(1000)
# (2) Get Data
import tflearn.datasets.oxflower17 as oxflower17
x, y = oxflower17.load_data(one_hot=True)
# (3) Create a sequential model
model = Sequential()
# 1st Convolutional Layer
model.add(Conv2D(filters=96, input_shape=(224,224,3), kernel_size=(11,11),\
strides=(4,4), padding='valid'))
model.add(Activation('relu'))
# Pooling
model.add(MaxPooling2D(pool_size=(2,2), strides=(2,2), padding='valid'))
# Batch Normalisation before passing it to the next layer
model.add(BatchNormalization())
# 2nd Convolutional Layer
model.add(Conv2D(filters=256, kernel_size=(11,11), strides=(1,1), padding='valid'))
model.add(Activation('relu'))
# Pooling
model.add(MaxPooling2D(pool_size=(2,2), strides=(2,2), padding='valid'))
# Batch Normalisation
model.add(BatchNormalization())
# 3rd Convolutional Layer
model.add(Conv2D(filters=384, kernel_size=(3,3), strides=(1,1), padding='valid'))
model.add(Activation('relu'))
# Batch Normalisation
model.add(BatchNormalization())
# 4th Convolutional Layer
model.add(Conv2D(filters=384, kernel_size=(3,3), strides=(1,1), padding='valid'))
model.add(Activation('relu'))
# Batch Normalisation
model.add(BatchNormalization())
# 5th Convolutional Layer
model.add(Conv2D(filters=256, kernel_size=(3,3), strides=(1,1), padding='valid'))
model.add(Activation('relu'))
# Pooling
model.add(MaxPooling2D(pool_size=(2,2), strides=(2,2), padding='valid'))
# Batch Normalisation
model.add(BatchNormalization())
# Passing it to a dense layer
model.add(Flatten())
# 1st Dense Layer
model.add(Dense(4096, input_shape=(224*224*3,)))
model.add(Activation('relu'))
# Add Dropout to prevent overfitting
model.add(Dropout(0.4))
# Batch Normalisation
model.add(BatchNormalization())
# 2nd Dense Layer
model.add(Dense(4096))
model.add(Activation('relu'))
# Add Dropout
model.add(Dropout(0.4))
# Batch Normalisation
model.add(BatchNormalization())
# 3rd Dense Layer
model.add(Dense(1000))
model.add(Activation('relu'))
# Add Dropout
model.add(Dropout(0.4))
# Batch Normalisation
model.add(BatchNormalization())
# Output Layer
model.add(Dense(17))
model.add(Activation('softmax'))
model.summary()# (4) Compile
model.compile(loss='categorical_crossentropy', optimizer='adam',\
metrics=['accuracy'])
# (5) Train
model.fit(x, y, batch_size=64, epochs=1, verbose=1, \
validation_split=0.2, shuffle=True)

Here I used different data-set and used Batch Normalization and different input dimension to reduce trainable parameters, so that it can be run on CPU. The original architecture has 1.1 billion parameters which will require GPU to run smoothly.

--

--

Atulanand
CodeX
Writer for

Data Scientist @Deloitte | Infosys Certified Machine Learning Professional | Google Certified Data Analytics