Dense or Convolutional Neural Network

Part 1 — Architecture, geometry, performance

Published in

Analytics Vidhya

8 min readJan 29, 2020

When it comes to designing a deep neural network (DNN), there are a few top-level architecture choices, one of them is should I use a convolutional or a dense (aka. perceptron, fully connected, or inner product) layer?

In this series of posts, we will review the differences between these layer types from architecture, geometry, and performance point of view.

In this first part, we will design and compare the performance of two networks acting as MNIST’s digits classifier [1]. The first network is a dense only network as proposed by TensorFlow in the tutorials [2]. The second network is a landmark of the DNN and is usually associated with the MNIST dataset: the LeNet 5 Convolutional Neural Network (CNN) [3].

Architecture review

Convolutional layers are so called as they operate similarly to the convolutional filters used in image/video processing and any kind of signal processing (sound, telecommunication…). A very good introduction to convolutional layers is in [4]. Convolution might be any dimension, however in the following we will focus on the 2D convolutions as used in image processing, in particular in the LeNet-5 network.

Equivalent Dense of a Convolution filter

From an architecture point of view, any single convolution can be replaced by a Dense layer that would perform the same association of neighboring pixels for each pixel. It would mean one neuron per pixel with not-null coefficients only on the neighbors. Convolutional layer is enforcing parameter sharing: the processing of each pixel is identical by design, not by learning. It means a dramatic reduction in the number of parameters to learn, still with a very good performance, as we will see in the evaluation section.

There are other geometrical properties related to convolutional layers, this will be dealt with in Part-3.

The models

For our comparison, we will start from the Dense model of the TensorFlow tutorial [2], and an implementation of LeNet-5 based on Keras [5].

Dense neural network for MNIST classification

Dense implementation is based on a large 512 unit layer followed by the final layer computing the softmax probabilities for each of the 10 categories corresponding to the 10 digits:

modelDense0 = models.Sequential([
  layers.Flatten(input_shape=(28, 28)),
  layers.Dense(512, activation=activations.relu),
  layers.Dropout(0.2),
  layers.Dense(10, activation=activations.softmax)
])

The input image of size 28x28 pixels is transformed into a vector in the Flatten layer, giving a feature space of width 784.

The number of coefficients of this DNN is mainly on the first layer of 512 neurons all connected to the 784 weights, that is 784*512 + 512 = 401 920 weights to compute including the bias. And a total of 407 050 coefficients.

Dense neural network loss and accuracy on MNIST train set

The LeNet-5 implementation has more layers but none of them is as big:

modelLeNet0 = models.Sequential([
  layers.Conv2D(filters=6, kernel_size=(3, 3),
    activation=activations.relu, 
  layers.AveragePooling2D(),
  layers.Conv2D(filters=16, kernel_size=(3, 3),
    activation=’relu’),
  layers.AveragePooling2D(),  layers.Flatten(),
  layers.Dense(units=120, activation=activations.relu),
  layers.Dense(units=84, activation=activations.relu),
  layers.Dense(units=10, activation =activations.softmax)
])

Original implementation was using the Tanh function for the activation, it is now more frequent to use the ReLU, it is leading to faster training and lower probability of vanishing gradient.

There are two convolutional layers based on 3x3 filters with average pooling. The feature space is thus reduced from 32 x 32 x 3 down to 6 x 6 x 16. They are followed by 2 hidden and dense layers of 120 and 84 neurons, and finally the same 10 neuron softmax layer to compute the probabilities. Total number of coefficients of the LeNet-5 is 101 770, a quarter of the Dense DNN.

LeNet5 neural network loss and accuracy on MNIST train set

The performance of these two baseline networks has been measured on MNIST:

Dense DNN, test accuracy = 97.5%
LeNet-5 CNN, test accuracy = 98.5%

There is already a clear advantage to the convolutional neural network, in size and performance. The only drawback is the training time which is longer given the number of layers. It used to take several days or week to train in the 90’s. Now days, it takes a few minutes.

Model optimization

The two networks evaluated above overfit and have a performance drop when testing on new samples. This is observable as:

There is a large gap on the losses and accuracies between the train and validation evaluations
After an initial sharp decrease, the validation loss is worsening with training epochs

This is not unexpected since the number of training samples is 60 000, which is smaller than the number of coefficients to train. The networks are learning and memorizing the training samples.

In the following we will optimize these two networks adding regularization and looking for the best size-performance tradeoff.

Regularization

Regularization is a set of techniques to speed up the convergence during training and to avoid overfitting. There are a few families of regularization:

· Penalization: a penalization term is added to the gradient back-propagation in order to “pushback” the coefficients toward 0. The classical penalizations are the Lasso (based on L1 norm) and Ridge (based on L2 normed), there are many others which are varying the used norm, and there is the Elastic-Net [7] that is combining Lasso and Ridge.

· Early stopping: given that overfitting is happening when the network is learning the training samples, and that it is observable through a performance drop during validation, the training is stopped when such inflexion is detected.

· Dropout [6]: for each batch, a random portion of the outputs are nullified in order to avoid strong dependencies between portions of adjacent layers. This technique is similar to boosting techniques on decision trees.

· Data augmentation: more training samples are created from existing one through geometrical transformations (translation, scaling, rotation…) or filters (blur).

· Model size reduction to tilt the ratio number of coefficients over number of training samples.

Within the Dense model above, there is already a dropout between the two dense layers. Given the observed overfitting, we have applied the recommendations of the original Dropout paper [6]: Dropout of 20% on the input, 50% between the two layers. The overfitting is a lot lower as observed on following loss and accuracy curves, and the performance of the Dense network is now 98.5%, as high as the LeNet5!

On the LeNet5 network, we have also studied the impact of regularization. At the time it was created, in the 90’s, penalization-based regularization was a hot topic. However, Dropout was not known until 2016. Using grid search, we have measured and tuned the regularization parameters for ElasticNet (combined L1-L2) and Dropout.

LeNet5 with tuned Dropout leading to accuracy of 99.4%

We have found that the best set of parameters are:

For penalization: L2 regularization on the first dense layer with parameter lambda=10–5, leading to a test accuracy of 99.15%
For dropout: dropout applied on the input of the first two dense layer with parameter 40% and 30%, leading to a test accuracy of 99.4%

Dropout is performing better and is simpler to tune.

Model size optimization

As we want a comparison of the Dense and Convolutional networks, it makes no sense to use the largest network possible. In fact, to any CNN there is an equivalent based on the Dense architecture. In [6], some results are reported on the MNIST with two dense layers of 2048 units with accuracy above 99%. Looking at performance only would not lead to a fair comparison.

Dense DNN accuracy as function of layer #0 size

You may also have some extra requirements to optimize either processing time or cost.

And as explained above, decreasing the network size is also diminishing the overfitting.

That’s why we have been looking at the best performance-size tradeoff on the two regularized networks. Here are our results:

Dense network with DropOut, with a hidden layer of 128 units, that is 101 770 coefficients, test accuracy of 98%
LeNet5 network with Dropout, with dense hidden layers of 60 and 42 units (half of initial), 38 552 coefficients, test accuracy of 99.2%

The CNN is the clear winner it performs better with only 1/3 of the number of coefficients

Conclusion

In this post, we have explained architectural commonalities and differences to a Dense based neural network and a network with convolutional layers. We have shown that the latter is constantly over performing and with a smaller number of coefficients.

We have also shown that given some models available on the Internet, it is always a good idea to evaluate those models and to tune them. Going through this process, you will verify that the selected model corresponds to your actual requirements, get a better understanding of its architecture and behavior, and you may apply some new technics that were not available at the time of the design, for example the Dropout on the LeNet5.

The code and details of this survey is available in the Notebook (HTML / Jupyter)[8].

In next part we will continue our comparison looking at the visualization of internal layers in Part-2, and to the robustness of each network to geometrical transformations in Part-3.

Do not forget to leave a comment/feedback below. You may now give a few claps and continue to the Part-2 on Interpretability.

References:

MNIST dataset, Yann Lecun et al. — http://yann.lecun.com/exdb/mnist/
Dense implementation of the MNIST classifier, TensorFlow tutorials — https://www.tensorflow.org/tensorboard/get_started
Gradient-Based Learning Applied to Document Recognition, Lecun et al. — http://yann.lecun.com/exdb/publis/pdf/lecun-98.pdf
A Beginner’s Guide to Convolutional Neural Networks (CNNs), Suhyun Kim — https://towardsdatascience.com/a-beginners-guide-to-convolutional-neural-networks-cnns-14649dbddce8
LeNet implementation with Tensorflow Keras — https://colab.research.google.com/drive/1CVm50PGE4vhtB5I_a_yc4h5F-itKOVL9
Dropout: A Simple Way to Prevent Neural Networks from Overfitting, Nitish Srivastava et al. — http://jmlr.org/papers/v15/srivastava14a.html
Regularization and variable selection via the elastic net, Hui Zou and Trevor Hastie — https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.124.4696