Deep study of a not very deep neural network. Part 1: What’s in our data

Published in

Towards Data Science

10 min readApr 28, 2018

Introduction

People who begin their journey in Deep Learning are often confused by the problem of selecting the right configuration and hyperparameters for their neural networks. Various courses, tutorials and textbooks just briefly describe the available options and don’t provide guidance on their applicability and performance. In most tutorials, it is stated that some function, optimizer, or initializer is just preferred over others, without explaining why, or under what conditions. Certain parameters, like the number of units, or epochs, are sometimes given without any justification. Students are told to ‘play around’ with them on their spare time to see if it can improve the results. Various authors refer to the process of finding the best configuration for a neural network as an art, and say that the ability to pick the right values comes with experience. I don’t believe it’s always true. Some common knowledge should be made easily available to the people who are making their first steps in this field, without the need to go through hundreds of research papers.

So I’m starting a series of articles, where I describe the results of an experiment I’ve been conducting for the last couple of months. Namely I have evaluated almost every tunable bit of a very simple neural network to see how the changes affect the resulting accuracy.

Specifically, in this series we will see:

How does the statistics of the training data affects the quality of the model;
Which combinations of an activation function and an optimizer work better for a fully-connected neural network;
What effect has a learning rate that is too high and how to deal with it;
How to make the training more stable and reproducible;
How initializers, regularizers and batch normalization influence the internals of a neural network;
How the performance of the model depends on the number of units and layers.

In this series I will focus solely on a fully-connected neural network designed to classify handwritten digits from the famous MNIST dataset. The reasons for using MNIST are simple: it is well-balanced, so we can be sure that the test results are adequate; it is not too large, so we can train the models with various parameters combinations in a reasonable time, and it is built into all the major Deep Learning frameworks.

Fully-connected neural network architecture is probably the simplest one out there, and it is often the first one that is presented in books and courses on Deep Learning. Despite being simple, even this architecture has so many parameters and options to play with, that often people just leave everything at their default values, hoping that these values will result in acceptable performance.

Of course there are much more appropriate ways to deal with MNIST, like using Convolutional Neural Networks. But here our task is not to beat the state-of-the-art model’s score. Instead, we are focussing on the role of each parameter of our network in the resulting accuracy.

This is not an introductory tutorial, and the readers are expected to have at least basic understanding of neural networks, how they work, how to build and to train them. For building the network I will use Keras with TensorFlow backend.

Experiment design

The baseline code for our experiment will be taken from the Keras library’s examples. We will be slightly modifying it over the course of this series to allow us to see how changes in its parts affect the accuracy on the test set.

The configurations will be evaluated based on the validation accuracy values, because it is the most objective metric in our case. Loss score does not provide much information, and accuracy on the training set is not representative because of the possible overfitting.

Each configuration will be tested 5 times (i.e. 5 neural networks will be trained from scratch for each configuration) in order to reduce the influence of the inevitable randomness, and the resulting accuracy of training these configurations will then be averaged. This will help to ensure that we see more representative results.

Our neural network will initially consist of two fully-connected layers with 64 units each, and one output layer with 10 units and softmax activation.

Data preparation

The first part of this series discusses various techniques used to transform the data before feeding it into a neural network.

MNIST dataset contains 60 000 training and 10 000 test images, each image is 28x28 pixels. Each pixel has a value between 0 and 255, where 0 represents completely black color, and 255 is white. In data science the data is usually scaled into small real numbers. The reason for this is to make the model to converge faster, and find a better minima.

Here’s how Geoffrey Hinton formulates this problem: “when the error surface is a quadratic bowl” … “going downhill reduces the error, but the direction of steepest descent does not point at the minimum unless the ellipse is a circle. “ And here’s how he visualizes this problem:

Fig.1 Error surfaces with shifted and scaled inputs. Source: http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf

There are two approaches to transform the data before feeding it to a neural network, and make the error surface more even.

One is to transform the data so that all vector components are fit into the [0; 1] range. This is called normalization. Here we take the minimum values found across each component of our input vector, subtract them from the respective components, and then divide by the maximum values for each component.

Another approach is to make the data to have 0 mean and unit variance (equal to 1). It is achieved by computing the mean and standard deviation of each component, then subtracting the mean from each value of the component and dividing it by standard deviation of respective components. Sometimes this is also referred to as normalization, but here for the sake of clarity we will call it standardization.

When dealing with MNIST we work with 28x28 images, and in order to be able to feed it into a fully-connected network, we first need to flatten them, i.e. transform each image into a 28x28 = 784-dimensional vector. In MNIST all components of this vector represent pixels, which can have values only from the 0 to 255 range.

Therefore it may be natural to assume that each image should have at least one completely black and completely white pixels, and just work with global minimum and maximum, or mean and standard deviation values across the entire dataset. In reality each sample from the dataset may have its own statistics, so if you transform them using the global values, the dataset as a whole would have the desired statistics, but each individual sample would become shifted and scaled. To illustrate this let’s look at the first 4 MNIST samples:

It is almost certain that each of these images have at least one completely black pixel (i.e. zero-valued). But it may not be the case for the value 255, representing the white color. Look at these samples:

Surprisingly, these samples have no completely white pixels, even though you may think that the bold white areas should be white. Therefore if you normalize them using the global min and max values, the value range of each of these images would become scaled in a different way, than for the samples, whose min and max values are equal to the global ones. It is not a big issue for our black and white images, human eye would not even notice such a change. But in other cases, like with colored images this may completely transform the image, therefore it is always a better approach to normalize each sample independently.

The same thing is with standardization. Intuitively we may understand that the distribution of various pixel values is different for each image, and standardizing them using the global mean and standard deviation values may incorrectly transform the samples and in the end lead to poorer performance of the model. We will check this assumption soon.

Now let’s look at the the plot of the normalized data distribution (it is nearly the same for dataset-wise and sample-wise normalization, so I’ll present just the one for the first):

Fig.4 MNIST data. Pixel values distribution after dataset-wise normalization.

This histogram shows the counts of each discrete value in our data. From the histogram on the left we see that a large portion of the data are zeros. In fact, zero values constitute about 20% of all data points in our normalized data, i.e. 20% of pixels in our MNIST image data are black-colored. If we plot the data distribution without zeros, the next large group is ones, representing completely white pixels.

The statistics for the normalized data:

The histogram for dataset-wise standardized data is identical with the exception that the horizontal axis now has the range from -0.424 to 2.822.

Fig.5 MNIST data. Pixel values distribution after dataset-wise standardization.

The form of the data did not change, but it has been shifted and scaled. Now let’s look at the statistics for dataset-wise standardized data:

The dataset is now standardazied as a whole (that weird numbed in the Mean for Dataset is essentially zero), but each sample has different means and variances. Also note that identity of the data distribution shape is confirmed by the same global skewness and kurtosis as for the normalized data.

Now let’s standardize each individual sample. This is the plot of the sample-wise standardized data:

Fig.6 MNIST data. Pixel values distribution after sample-wise standardization.

And the stats:

A significant difference. Now each sample is standardized, but the dataset as a whole is also standardized: the mean is so tiny, that we can count it as zero, and the variance is very close to 1.

Now let’s check, if these transformations really make the difference. We will train our simple neural networks using all four transformed datasets.

Training results

The example code initially uses RMSProp optimizer and ReLU activation, so we will be using these in our first experiment. We will compare the results of training the same configuration of a neural network on each of the four data transformation types: Normalized dataset- and sample-wise, and Standardized dataset- and sample-wise.

They will be compared by the following measures:

Averaged on Last Epoch: Validation Accuracy value on the 100th epoch averaged across 5 experiments;
Averaged Max Achieved: The highest Validation Accuracy value of the averaged training stats across 5 experiments;
Overall Max on Last Epoch: The highest Validation Accuracy value on the 100th epoch out of 5 experiments
Overall Max Achieved: The highest Validation Accuracy value observed across all experiments for this activation;
Averaged Max Epoch: The number of the epoch, when the Averaged Max has been achieved.

Note: for calculating the averaged maximum accuracy in this and further experiments instead of finding the maxima for each of the 5 experiments for a particular configuration and then dividing them by five, I am averaging the whole process of training, i.e. for step N I take accuracy observed in each of the five experiments at step N and then average them. This way I calculate the averaged training progress and take it’s maximum accuracy. I believe, this approach betters represents the typical training results you would get with a particular configuration.

It is clear that on our network configuration normalized data demonstrates better results, than the standardized one. From these results it is not possible to tell, whether dataset-wise normalization is better than the sample-wise one, but it may be only because of the specifics of our dataset. Most probably with other datasets you will be getting higher results with sample-wise normalization.

As for standardization, we have confirmed that it was wrong to use the global values to standardize our training data. However, the fact that sample-wise standardization resulted in lower accuracy in our experiment doesn’t mean that this is always the case. In the later parts we will see that in some particular neural network configurations sample-wise standardization may lead to much higher results than with normalized data.

Below are the plots showing the accuracy change over the course of training on four types of data:

Fig.7 Validation accuracy for dataset-wise and sample-wise input data transformations.

The dotted lines represent individual experiments and the black lines is the averaged accuracy across these experiments. Here are all four averages compared:

Fig.8 Average validation accuracy for networks trained on various transformations of the input data.

Again, networks trained on normalized data follow each other very closely, and sample-wise standardized training goes just slightly worse, but not as bad as for the dataset-wise standardization. Also, note that for the standardized data the training is a bit less stable, i.e. it has larger changes of accuracy, both up and down as the training progresses. Partially this can be solved by adapting the learning rate, and we will try it in one of the next parts.

The learning points from the first part:

Before building your neural network and starting the training, take a look at your data: the stats and its underlying structure may play a big role in achieving high performance of your model;
Normalize or standardize the data component-wise, rather than using global stats values, otherwise you may alter it so that some valuable information hidden your data may become lost or corrupted.

The code for the experiment is available on my github. In the next part we will investigate activation functions, which are the bits of a neural network that transform your prepared input data into outputs. Stay tuned!

I’m always happy to meet new people and share ideas, so if you liked the article, cosider adding me on LinkedIn.

Deep study of a not very deep neural network series:

Part 1: What’s in our data
Part 2: Activation functions
Part 3a: Optimizers overview
Part 3b: Choosing an optimizer
Part 4: How to find the right learning rate
Part 5: Dropout and Noise
Part 6: Weights initialization
Part 7: Regularization
Part 8: Batch normalization
Part 9: Size matters
Part 10: Merging it all together

Deep study of a not very deep neural network. Part 1: What’s in our data

Introduction

Experiment design

Data preparation

Training results

Written by Rinat Maksutov