The structure of AlexNet which changed the history of deep learning

Published in

The Deep Hub

4 min readFeb 4, 2024

Hello! I’m Yuma Ueno(https://twitter.com/stat_biz), from Japan. I’m into AI Industry, and runnning my own small company about AI.

In this article, I would like to explain in detail about AlexNet, which cannot be left out when explaining the history of deep learning!

Recently, AI has been booming with the emergence of large language models, and one of the triggers of this evolution is AlexNet, which I will explain in this article!

What is AlexNet?

AlexNet is the method that won the ILSVRC (ImageNet Large Scale Visual Recognition Challenge) 2012, a competition for image recognition accuracy using ImageNet, by a huge margin over second place.

ImageNet is a huge dataset of images with labels for over 14 million images.

(citation：ImageNet Large Scale Visual Recognition Challenge)

Before 2012, the mainstream approach was to extract features from images by humans and build models based on them. AlexNet, which suddenly appeared in 2012, was a huge neural network architecture that did not require humans to extract features but allowed models to extract features and perform image recognition.

It won the championship with an overwhelming margin of recognition error rate of more than 10% over the second-place winner.

What is the structure of AlexNet!

Let’s look at what was revolutionary about the structure of AlexNet based on the paper!

AlexNet was published by the Jeffrey Hinton Lab at the University of Toronto, Canada.

The paper is below.

ImageNet Classification with Deep Convolutional Neural Networks

The architecture described in the paper is as follows

The convolutional layer, pooling layer, and all coupling layers are combined into a huge neural network.

Let’s take a look at what exactly was groundbreaking about the paper!

Introducing the ReLU Function to the Activation Function

In deep learning, a function called an activation function is used to convert a computation based on inputs and weights into an output.

There are many types of activation functions, but for a long time, sigmoidal functions and tan functions were commonly used.

The sigmoid function can keep the output for a given input in the range of 0 to 1, and its graph is shown below.

However, when using a sigmoid function, there is a problem with the occurrence of vanishing gradient problem for the optimal solution.

The vanishing gradient problem is a problem in which the weights are updated by calculating the gradient through differentiation to find the optimal solution in deep learning, but if the gradient is close to 0, the update range becomes too small and the optimal solution cannot be reached.

The sigmoid function is differentiated multiple times to calculate the gradient, but the maximum derivative value of the sigmoid function is 0.25. Therefore, if the sigmoid function is used in many intermediate layers, the gradient becomes smaller as the layers overlap, causing the vanishing gradient problem!

The ReLU function solves this problem.

The RELU function is a special function that outputs 0 if the value is less than 0, and outputs the result as it is if the value is greater than 0.

The maximum derivative of the ReLU function is 1, so it is possible to layer without the vanishing gradient problem.

It is almost never used in the output layer because it is a very poor activation function for the output layer.

The graph is shown below.

Data Augmentation to Reduce Overlearning

AlexNet is a much larger architecture than previous models, and because of its large number of parameters, it is necessary to be careful about overlearning.

The approach taken is data augmentation, which artificially amplifies existing data sets.

Specifically, a new data set is created by editing the image data by moving its position or changing the RGB hue.

Dropout to control overlearning

Dropout is often used as an approach to reduce overlearning, and AlexNet, as usual, applies dropout.

AlexNet also applies dropout, in which the output is zero with a probability of 0.5 in some of the fully-connected layer.

Overlapping Maximum Pooling Layer

In deep learning, a layer called the pooling layer is commonly used to suppress overlearning.

The role of this pooling layer is to compress the information of the elements in each region calculated.

And while there are approaches that output the maximum value or the average value, AlexNet uses the approach that outputs the maximum value.

Summary

With the advent of AlexNet, AI, which had been on the decline until then, was suddenly ignited, leading to an AI boom that continues to this day.

After AlexNet, ResNet and other improved methods appeared, and AlexNet itself is no longer used, but it is still based on the breakthroughs made by AlexNet.

However, AlexNet is still based on the breakthroughs made by AlexNet. Furthermore, the breakthroughs made by AlexNet have undoubtedly led to the recent trend toward image generation AI.

Please clap and comment, follow if you like it!

See you next time!