A short story about over-fitting your neural network

A short story about over-fitting your neural network (or the importance of validation data on Keras)

enrique a.
Coinmonks
Published in
4 min readJun 26, 2018

--

This a short story about the importance of adding a validation partition to be evaluated each epoch instead of waiting for the end of your training to measure how good is your trained model to predict instances never seen on the training phase, or in other words, to generalize the problem.

Partially inspired in real events

Let’s suppose you are working on a text classifier using Recurrent Neural Networks on Keras. A binary problem where a sentence is given to the network word by word and the expected output is a single value predicting if the sentence is “positive” or “negative” (whatever that means in your context). Yes, yes, just like the one I detailed in my previous story 😉:

To train the classifier you are NOT using a pre-cooked dataset like the ones provided by Keras. You have created your own set, carefully preparing the training examples and their labels.

At this point you don’t know is possible, or you don’t plan to send the validation data to the Keras fit function, but you do plan to split the dataset to evaluate at the end of the training (let’s say 90% training, 10% evaluation). For that purpose you chose to create your own function to shuffle and split the dataset, instead of using an already available one (like sklearn’s train_test_split).

However, you unknowingly make a mistake! Instead of shuffling the training sets (X) and their labels (Y) at unison, you shuffle them independently, consequently you have all the labels (Y) mixed-up.

You have messed up the labels! But since you don’t know yet, you start the training. Because there is no real relationship between the true nature of the sentence and its label, the training will be a total mess, right? right?

Well, contrary to what we can expect, at first glance it doesn’t look like a total mess:

As you can see, 10 epochs and more one hour of training has passed, the loss (loss: 0.3721) is decreasing and the accuracy on the training set has been steadily increasing up to 80% (acc: 0.8067)! What is happening?

Using a fairly complex network architecture, specially without regularization (like in this case), the training will be able to fit to adjust to the details of the examples and also their noise. In this case since the labels are mixed-up, the training set is mainly random noise and somehow the network is still able to “learn” the details of this noise.

What should you have done, then?

If you had started the same training sending validation data to the fit (or fit_generator) function, each epoch you would have seen that the accuracy on the validation set (val_acc, last column) is not increasing at all, and moreover the loss (val_loss:) is increasing (from 0.6935 to 1.4955 at the end of the epoch 10):

Not surprisingly, with examples never seen by the training process the classifier is no better than flipping a coin.

If you send validation data you will be able to detect this kind of problems after a small number of epochs, instead of waiting for the end of the training to find out something is wrong.

How to do it?

It is pretty simple, if you are using model.fit (without data generator) there is an argument to automatically split from the training set:

  • validation_split: Float between 0 and 1. Fraction of the training data to be used as validation data. The model will set apart this fraction of the training data, will not train on it, and will evaluate the loss and any model metrics on this data at the end of each epoch. The validation data is selected from the last samples in the x and y data provided, before shuffling.

On the other hand, using model.fit_generator you will need to send the validation data or data generator with the following arguments:

  • validation_data: This can be either
  1. a generator or a Sequence object for the validation data
  2. tuple (x_val, y_val)
  3. tuple (x_val, y_val, val_sample_weights)

on which to evaluate the loss and any model metrics at the end of each epoch. The model will not be trained on this data.

  • validation_steps: Only relevant if validation_data is a generator. Total number of steps (batches of samples) to yield from validation_data generator before stopping at the end of every epoch. It should typically be equal to the number of samples of your validation dataset divided by the batch size. Optional for Sequence: if unspecified, will use the len(validation_data) as a number of steps.

More information:

--

--

enrique a.
Coinmonks

Writing about Machine Learning, software development, python. Living in Japan working as a machine learning leader in a Japanese company.