Understanding LeNet-5 CNN Architecture (Deep Learning)

4 min readApr 16, 2022

In this article you will learn about the LeNet-5 architecture. It was introduced in the research paper “Gradient-Based Learning Applied To Document Recognition” in the year 1998 by Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner.

Introduction

LeNet 5 is one of the first CNN architecture after which Convolutional neural network gain the popularity. It was built to synthesize a complex design that dimensional pattern such as handwritten characters, with a minimum processing. Through this architecture they identify the handwritten “cheque” amount or a dollar amount and recognise it in order to pay it.

MNIST

The data base used to train and test the system was built from NIST SD1 and NIST SD 3

Modified NIST is centred and scaled dataset

50000 training data and 10000 test data

32*32 image as input and 10 classes in output because 10 digits

LeNet-5 Architecture

LeNet 5 comprises 7 layers, not counting input layer of 32*32 image. Convolutional layer are labled as Cx, sub-sampling[pooling] layer are labeled as Sx, fully connected layer are labelled as Fx, where ‘x’ is the layer index.

Layer C1 is convolutional layer with 6 features stack one over other. Each unit in each feature map is connected to 5*5[filter] neighborhood in the input. It learns in total 208 parameters in which 200 are weights and 8 are bias. The dimension of conv-1 is 28*28*6.

Layer S2 is sub-sampling layer with 6 features of size 14*14. It is obtained by ‘average pooling’ of size 2*2 on C1 with stride of 2 thus makes half the dimension of C1. Dimension of S2 is 14*14*6. Since we are just taking average of the 2*2 filter on C1, there is ‘no learning’ in this layer.

Layer C3 is convolutional layer with 16 features stack one over other. Each unit in each feature map is connected to 5*5[filter] neighborhood in the S2 with stride of 1. It learns in total 416 parameters in which 400 are weights and 16 are bias. The dimension of conv-2 is 10*10*16

Layer S4 is sub-sampling layer with 16 features of size 5*5. It is obtained by ‘average pooling’ of size 2*2 on C3 with stride of 2 thus makes half the dimension of C3. Dimension of S4 is 5*5*16. Here also, there is ‘no learning’ in this layer.

Layer C5 is convolutional layer with 120 features. Each unit in each feature map is obtain from 5*5 [filter] neighborhood of S4. Since S4 size is 5*5 and the filter is also 5*5 the size result in 1*1. Also, C5 is labelled as convolution instead of fully connected because if LeNet-5 inputs made bigger with every thing else kept constant, the feature map dimension would be larger than 1*1.

Layer F6 is a fully connected layer with 84 units and is connected to C5. It has 120*84 + 84 training parameters of which 120*84 are weights and 84 are bias. Layer F7 is the last layer having 10 units. since we have only 10 digits

Activaton function: Tanh

Pooling: Average Pooling 2*2

Kernel Size: 5x5

Stride: 1 for kernel, 2 for pooling

Padding: 0, 0

Animation of LeNet 5

Here you can see the digits are scanning through network, and network outputting its estimate of what it thinks that digit would be. Also, you can see that even for different shifts of digits in image it still recognize exact digit.

Summary

Conclusion

(nh,nw) decreases with increase in ‘nc’ [h=height, w= width, c= channel]
It was average pooling which is not used much these days. We use max pool generally.
Activation used in LeNet were sigmoid and tanh because its old research paper. But now if we would want to use it we use Relu.
Also, towards output we have fully connected layer, these fully connected layer can be ‘expensive’ if we have more outputs eg.1000

I hope you found the article useful

follow me on medium
connect and reach on linkdin