LeNet with TensorFlow

mrgrhn
Analytics Vidhya
Published in
6 min readJan 18, 2021

LeNet is considered to be the ancestor of convolutional neural networks and is a well-known model among the computer vision community.

LeNet for Digit Recognition
LeNet for Digit Recognition

Introduction

LeNet is one of the most fundamental deep learning models that is primarily used to classify handwritten digits. Proposed by Yann LeCun[1] in 1989, LeNet is one of the earliest neural networks that employ the convolution operation. Combining newly developed back-propagation algorithms with convolutional neural networks, LeCun et al. became pioneers of image classification using deep learning. The name LeNet is mostly used interchangeably with LeNet-5 which indicates the kernel size of the convolution masks.

This tutorial is intended for beginners to demonstrate a basic TensorFlow implementation of LeNet on the MNIST dataset. References to the related papers are shared at the end of the blog post. For a better understanding of the underlying concepts you can go through these videos:

Backpropagation:

Convolutional Neural Networks:

LeNet with TensorFlow

Tensorflow is one of, if not the, most popular deep learning frameworks that enables machine learning enthusiasts to work with general prototype models. Although it is not a good habit to build models by the default settings of TensorFlow without inspecting the consequences of certain design choices, Keras (a deep learning library that performs on TensorFlow backend) makes it fairly easy to test flexible models when utilized properly.

Without further introduction, we can jump into LeNet implementation with TensorFlow. The code is written as a Jupyter Notebook[5] hosted by Google Colab[6]. The link for the whole notebook is shared just before the References section. In order to utilize free GPUs of Google for your computations, please follow: Runtime > Change runtime time > Hardware accelerator: GPU in your Colab notebook.

First, needed libraries are imported. TensorFlow and Matplotlib are imported to design the network and visualize the results, respectively. Moreover, Keras offers certain architecture and training templates as well as common datasets, which are handy to use.

import tensorflow as tf
import matplotlib.pyplot as plt
from tensorflow.keras import datasets, layers, models, losses

The Data

After importing the libraries, the dataset is downloaded by a single line of code. Be careful to take the outputs in the correct format as below:

x_train,y_train),(x_test,y_test) = datasets.mnist.load_data()
x_train.shape
(60000, 28, 28)

Original LeNet model receives 32 by 32 images, thus 28 by 28 MNIST images are padded with zeros and 8-bit (0–255 range) pixel values are scaled between 0 and 1:

x_train = tf.pad(x_train, [[0, 0], [2,2], [2,2]])/255
x_test = tf.pad(x_test, [[0, 0], [2,2], [2,2]])/255
x_train.shape
TensorShape([60000, 32, 32])

However, most CNN’s accept 4-dimensional tensors as inputs having the dimensions of batch size, height, width, and channel. Since MNIST images are grayscale, the last dimension does not necessarily exist. We need to expand the tensor and create a dummy dimension at axis number 3. (Recall that the tensor initially had axis 0, 1, and 2.)

x_train = tf.expand_dims(x_train, axis=3, name=None)
x_test = tf.expand_dims(x_test, axis=3, name=None)
x_train.shape
TensorShape([60000, 32, 32, 1])

The last 2000 samples of the training set are reserved for the validation set. The validation set is mainly used for tuning of the hyperparameters of the model. The test set is never used before the final evaluation.

x_val = x_train[-2000:,:,:,:] 
y_val = y_train[-2000:]
x_train = x_train[:-2000,:,:,:]
y_train = y_train[:-2000]

The Model

LeNet has a pretty simple architecture. There are 3 convolutional layers each having 5 by 5 kernels with 6, 16, and 120 feature maps, respectively. In between, there are 2 subsampling layers as average pooling. All these 5 layers use stride 1 and average pooling layers use 2 by 2 kernels as default settings. After the convolutional layers, hyperbolic tangent nonlinear activation is used, whereas subsampling layers are followed by sigmoid nonlinearity. After the last convolutional layer, the activations are flattened and fed into fully connected layers having 84 and 10 neurons. The output of the last layer (after softmax operation) represents the probabilities of the classes (from 0 to 9) for the input image.

Nowadays, tanh and sigmoid activations are rarely used due to saturation problems. Instead, ReLU and Leaky ReLU are much more popular. Also, using two consecutive 3 by 3 layers instead of a single 5 by 5 layer is more preferable, since the same receptive field size is obtained by a fairly lower number of trainable parameters.

model = models.Sequential()
model.add(layers.Conv2D(6, 5, activation='tanh', input_shape=x_train.shape[1:]))
model.add(layers.AveragePooling2D(2))
model.add(layers.Activation('sigmoid'))
model.add(layers.Conv2D(16, 5, activation='tanh'))
model.add(layers.AveragePooling2D(2))
model.add(layers.Activation('sigmoid'))
model.add(layers.Conv2D(120, 5, activation='tanh'))
model.add(layers.Flatten())
model.add(layers.Dense(84, activation='tanh'))
model.add(layers.Dense(10, activation='softmax'))
model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d (Conv2D) (None, 28, 28, 6) 156
_________________________________________________________________
average_pooling2d (AveragePo (None, 14, 14, 6) 0
_________________________________________________________________
activation (Activation) (None, 14, 14, 6) 0
_________________________________________________________________
conv2d_1 (Conv2D) (None, 10, 10, 16) 2416
_________________________________________________________________
average_pooling2d_1 (Average (None, 5, 5, 16) 0
_________________________________________________________________
activation_1 (Activation) (None, 5, 5, 16) 0
_________________________________________________________________
conv2d_2 (Conv2D) (None, 1, 1, 120) 48120
_________________________________________________________________
flatten (Flatten) (None, 120) 0
_________________________________________________________________
dense (Dense) (None, 84) 10164
_________________________________________________________________
dense_1 (Dense) (None, 10) 850
=================================================================
Total params: 61,706
Trainable params: 61,706
Non-trainable params: 0

The model is set to be optimized by Adam optimizer[7]. Sparse categorical cross-entropy measures the negative of the natural logarithm of the true class probability. For instance, if the final output of the model is a vector such as [0.03, 0.78, …, 0.05] and the true class of the input image is 1; the loss for this instance will be -ln(0.78) = 0.248. Accuracy metric is reported for each epoch. Model is trained for 40 epochs with a batch size of 64. history object has a property named history which is useful for keeping track of the training phase.

model.compile(optimizer='adam', loss=losses.sparse_categorical_crossentropy, metrics=['accuracy'])
history = model.fit(x_train, y_train, batch_size=64, epochs=40, validation_data=(x_val, y_val))
Epoch 1/40 907/907 [==============================] - 11s 6ms/step - loss: 1.8860 - accuracy: 0.2950 - val_loss: 0.2266 - val_accuracy: 0.9450
Epoch 2/40 907/907 [==============================] - 4s 5ms/step - loss: 0.3354 - accuracy: 0.8943 - val_loss: 0.1769 - val_accuracy: 0.9490
...
Epoch 40/40 907/907 [==============================] - 5s 5ms/step - loss: 0.0375 - accuracy: 0.9875 - val_loss: 0.0428 - val_accuracy: 0.9915

Results

Losses and accuracies for training and validation sets are stored in history object and plotted using Matplotlib library.

fig, axs = plt.subplots(2, 1, figsize=(15,15))  axs[0].plot(history.history['loss']) axs[0].plot(history.history['val_loss']) axs[0].title.set_text('Training Loss vs Validation Loss') axs[0].legend(['Train', 'Val'])  axs[1].plot(history.history['accuracy']) axs[1].plot(history.history['val_accuracy']) axs[1].title.set_text('Training Accuracy vs Validation Accuracy') axs[1].legend(['Train', 'Val'])
Losses and Accuracies for Training and Validation Sets
Losses and Accuracies for Training and Validation Sets

Testing accuracy of the model came out at 98.51%, which is pretty satisfactory for this simple task and reasonably slightly lower than training accuracy. The graph above and the testing accuracy indicates that the model is capable of learning the patterns of digit drawings and generalizable enough not to overfit.

model.evaluate(x_test, y_test)313/313 [==============================] - 1s 2ms/step - loss: 0.0468 - accuracy: 0.9851

lenet_tensorflow.ipynb

You can check out the whole notebook uninterrupted on Github and Google Colab.

Conclusion

The aim of contemporary CNNs well exceeded the simple tasks like MNIST classification however, it is indispensable to analyze fundamental models like LeNet on common datasets like MNIST to get the underlying ideas of more complex models. Trying to replicate the simple models and obtain similar results is a good first step for diving deep into new tasks. Additionally, examining the shortcomings of these models equally important and useful. In this post, LeNet architecture is explained and implemented with TensorFlow.

Hope you enjoyed it. See you in the following CNN models.

Best wishes…

mrgrhn

For the follow-up post, please visit:

References

  1. LeCun, Y.; Boser, B.; Denker, J. S.; Henderson, D.; Howard, R. E.; Hubbard, W.; Jackel, L. D. (December 1989). “Backpropagation Applied to Handwritten Zip Code Recognition”. Neural Computation. 1(4): 541–551.
  2. Lecun, Yann (June 1989). “Generalization and network design strategies”. Technical Report CRG-TR-89–4. Department of Computer Science, University of Toronto.
  3. LeCun, Y.; Boser, B.; Denker, J. S.; Henderson, D.; Howard, R. E.; Hubbard, W.; Jacker, L. D. (June 1990). “Handwritten digit recognition with a back-propagation network”. Advances in Neural Information Processing Systems 2: 396–404.
  4. Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. (1998). “Gradient-based learning applied to document recognition”. Proceedings of the IEEE. 86 (11): 2278–2324.
  5. https://jupyter-notebook.readthedocs.io/en/stable/notebook.html
  6. https://colab.research.google.com/notebooks/intro.ipynb
  7. Kingma, Diederik & Ba, Jimmy. (2014). “Adam: A Method for Stochastic Optimization”. International Conference on Learning Representations.

--

--

mrgrhn
Analytics Vidhya

Boğaziçi Üniversitesi ’20 Electrical & Electronics Engineering — Physics | Articles on various Deep Learning topics