Deep Learning — Overfitting

Published in

NewCryptoBlock

8 min readMar 19, 2019

Part II

A typical start for back propagation learning systems is with preparing a training dataset and then use the back propagation algorithm to calculate weights of a multilayer perception by loading as many training data samples into the network as possible. The goal is that the design of a neural network as ‘it is’ will generalize well. A network is said to generalize well when the input-output mapping computed by the network is correct, or close to, for test data never used in training the network. Of course, we are making the assumption that our test data is drawn from the same population that is used to generate the training data.

A neural network that is designed to generalize well will produce a correct input-output mapping even when the input is slightly different from the examples used to train the network. However, there is a potential problem. If your neural network learns too many input-output examples, the network may end up memorizing the training data which in that case presents only more complex lookup table. So, it may do so by finding a feature that is presented in the training data (noise in the data), but it is not true of the underlying function that we are suppose to model. This kind of the problem is called overfitting; when the network is overtrained, it looses the ability to generalize.

On the other end of the spectrum, as shown on the figure above, underfitting is opposite to overfitting and has an over simplistic model in place. On one side, we have an overly complex model, and on the other we have an overly simplistic one. The trade-offs between these two models (too simple vs. to complex) is a key concept in machine learning where we get our Good Fit model.

Preparing the Dataset

The challenge with machine learning in general is that we cannot know how well our model will perform on new data until we actually test it.

In order to evaluate a model, we split out dataset into three buckets: training dataset, validation dataset, and test dataset. As the name suggests, train your model on the training dataset and evaluate your model by using the validation dataset. Once the model is ready for deployment, you perform a final testing by uploading the test dataset.

Going back to our example from Part I, we can see that the dataset that we uploaded

# load data
dataset = numpy.loadtxt(datasetFileName, delimiter=",")

# split dataset into input and output variables
X = dataset[:, 0:8]
Y = dataset[:, 8]

was split on training dataset (67%) and validation dataset (33%) for evaluation purposes:

# create the model
model = create_model()

history = model.fit(X, Y, validation_split=0.33, epochs=150, batch_size=10, verbose=0)

# evaluate the model
scores = model.evaluate(X, Y)

In order to split the data into training, validation and test dataset, there are three classic evaluation recipes:

Simple hold-out validation
K-fold validation
Iterated K-fold validation with shuffling

By using Keras library it is very simple to use any of these three recipes. The previous code example demonstrates how to do a simple hold-on validation. Here is the example for K-fold validation in Keras:

import numpy
from keras.models import Sequential
from keras.layers import Dense
from sklearn.model_selection import StratifiedKFold

# dataset file
datasetFileName = "pima-indians-diabetes.csv"

# initialize random number generator
seed = 7
numpy.random.seed(seed)

# load data
dataset = numpy.loadtxt(datasetFileName, delimiter=",")

# split dataset into input and output variables
X = dataset[:, 0:8]
Y = dataset[:, 8]

# define 10-fold cross validation test harness
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)
cvscores = []

# define base model
def create_model():
    # create model
    model = Sequential()
    model.add(Dense(12, input_dim=8, kernel_initializer='uniform', activation='relu'))
    model.add(Dense(8, kernel_initializer='uniform', activation='relu'))
    model.add(Dense(1, kernel_initializer='uniform', activation='sigmoid'))

    # compile model
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model


for train, test in kfold.split(X, Y):
    # create the model
    model = create_model()

    # train the model
    model.fit(X[train], Y[train], epochs=150, batch_size=10, verbose=0)

    # evauate the model
    scores = model.evaluate(X[test], Y[test], verbose=0)
    print("\nAccuracy: %.2f%% \n" % (scores[1]*100))
    cvscores.append(scores[1]*100)


print("%.2f%% (+/- %.2f%%)" % (numpy.mean(cvscores), numpy.std(cvscores)))

Detecting Overfitting

Let’s go back to our original topic, the problem with overfitting. So how do we detect overfitting? As we have said, the fundamental issue in machine learning is the tension between optimization and generalization. If your model much better on training dataset, then on testing dataset, then we probably have an overfitting problem.

In our next example we are going to use imdb dataset which comes with Keras library. Actually, there are a few datasets that are included as part of Keras, such as:

imdb
Boston Housing Prices
Cifar
Mnist
Fashion Mnist
Reuters

So, here is our code example:

import numpy as numpy
from keras.models import Sequential
from keras.layers import Dense
from keras.datasets import imdb
import matplotlib.pyplot as plt

# initialize random number generator
seed = 7
numpy.random.seed(seed)

# encoding integer sequesnce into binary matrix
def vectorize_sequences(sequences, dimension=10000):
    # Create an all-zero matrix of shape (len(sequences), dimension)
    results = numpy.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
        # set specific indices of results[i] to 1s
        results[i, sequence] = 1.
    return results

def create_model():
    # create model
    model = Sequential()
    model.add(Dense(16, activation='relu', input_shape=(10000,)))
    model.add(Dense(16, activation='relu'))
    model.add(Dense(1, activation='sigmoid'))

    # compile model
    model.compile(loss='binary_crossentropy', optimizer='rmsprop',
                  metrics=['accuracy'])
    return model

# import imdb records
(train_data, train_labels), (test_data, test_labels) = \
    imdb.load_data( num_words=10000)

# Dataset preparation
# Vectorize training data
x_train = vectorize_sequences(train_data)
# Vectorize training labels
y_train = numpy.asarray(train_labels).astype('float32')

# Vectorize test data
x_test = vectorize_sequences(test_data)
# Vectorize test labels
y_test = numpy.asarray(test_labels).astype('float32')

# create validation set
x_val = x_train[:10000]
partial_x_train = x_train[10000:]
y_val = y_train[:10000]
partial_y_train = y_train[10000:]

# Now as we have prepared our training dataset and validation
# dataset lets create the model and run the training
model = create_model()
history = model.fit(partial_x_train, partial_y_train, epochs=20,
                    batch_size=512, validation_data=(x_val, y_val))

history_dict = history.history
loss_values = history_dict['loss']
val_loss_values = history_dict['val_loss']
acc = history_dict['acc']

epochs = range(1, len(acc) + 1)

# summarize history for loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model Loss')
plt.ylabel('loss')
plt.xlabel('number of epochs')
plt.legend(['train error', 'test error'], loc='upper left')
plt.show()

We also need to do vectorization of the dataset imported (I will be talking about this in one of next articles); for everything else the reader should be familiar with. At the end we are using a math plotting library to draw loss in dependency of the number of epochs we run.

In Model Loss diagram you can see where our model is Underfitting, and also where it starts with Overfitting. From the above diagram, we can see that in order to get the model properly trained, not overtrained, it is enough to only use 3 epochs. Here is another diagram that shows the relationship between Training Error and Testing Error:

Reduce Overfitting

Knowing how to detect Overfitting is a very useful skill but it does not solve our problem. Let’s now review some of the most common strategies for deep learning models in order to prevent overfitting. These strategies are known collectively as regularization techniques.

Early Stopping. The simplest technique to avoid overfitting -please check the Early Stopping graph above. Watch the testing curve and stop training, weight updates once testing error starts increasing.
Reduce the network complexity. With this technique we are reducing the number of learnable parameters in the model, which is typically referred to as network capacity. At the same time, you should keep in mind that you should use the models that have enough parameters so that model doesn’t underfit.
Weight Regularization. This method puts constraints on the complexity of a network by forcing its weights to take only small values. the effect of this technique is that it distributes the weight values more regularly. This method is done by adding to the loss function of the network a cost associated with having large weights. The cost is coming in two types:

a) L1 regularization — the penalty added is proportional to the absolute value of the weight coefficients

b) L2 regularization — the penalty added is proportional to the squared of the value of the weight coefficients

Sample code:

from keras.models import Sequential
from keras.layers import Dense
from keras import regularizers


def create_model():
    # create model
    model = Sequential()
    model.add(Dense(16, activation='relu', 
                    kernel_regularizer=regularizers.l2(0.001), 
                    input_shape=(10000,)))
    model.add(Dense(16, activation='relu', 
                    kernel_regularizer=regularizers.l2(0.001)))
    model.add(Dense(1, activation='sigmoid'))

    # compile model
    model.compile(loss='binary_crossentropy', optimizer='rmsprop',
                  metrics=['accuracy'])
    return model

4. Dropout. The idea behind the dropout is to randomly zero out some of the weights in a given layer. Dropout were introduced in the paper Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Each layer that has dropout enabled, which won’t be all layers, will have probability of some dropout neurons. By preventing the network of using the all neurons, we force the network to learn alternative representations. Dropout is never used on output layer. This is because the output neurons produce a probability distribution over all of the classes. In Keras we introduce the dropout in the network via dropout layer, which is applied to the output of the right before it. Here is a sample code:


from keras.layers import Dropout


model.add(Dropout(0.5))

Thus, if we apply this method to our example, I’m showing here only the model creation function that is impacted with the change, the rest of the code looks the same. Here is how the code is going to look like:


from keras.layers import Dropout

def create_model():
    # create model
    model = Sequential()
    model.add(Dense(16, activation='relu', input_shape=(10000,)))
    model.add(Dropout(0.75))
    model.add(Dense(16, activation='relu'))
    model.add(Dropout(0.75))
    model.add(Dense(1, activation='sigmoid'))

    # compile model
    model.compile(loss='binary_crossentropy', optimizer='rmsprop',
                  metrics=['accuracy'])
    return model

And from the Model Loss diagram, we can see that our network is performing better as loss is much smaller and that it is not that sensitive so we have more available options when we can stop the training. This is just an example and we could apply all our leaning in order to create a network that will not be prompt to overfitting.

Summary

Hope you enjoyed this reading, This series on Deep Learning will continue by exploring different aspects and topics related to Deep Learning. I hope to explore different a model in the next few articles.

References

Deep Learning with Python, By Francois Chollet, ISBN 9781617294433
Artificial Intelligence for Humans Volume 1: Fundamental Algorithms, By Jeff Heaton, ISBN978–1493682225
Artificial Intelligence for Humans Volume 3: Deep Learning and Neural Networks, By Jeff Heaton, ISBN978–1505714340
Develop Deep Learning Models on Theano and TensorFlow Using Keras, By Jason Brownlee
Deep Learning, By Ian Goodfellow, Yoshua Bengio and Aaron Courville, ISBN 9780262035613
Neural Networks and Learning Machines, By Simon Haykin, ISBN 9780131471399
//hackernoon.com/memorizing-is-not-learning-6-tricks-to-prevent-overfitting-in-machine-learning-820b091dc42
Dropout: A Simple Way to Prevent Neural Networks from Overfitting, by Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever and Ruslan Salakhutdinov

_________________________________________________________________

NewCryptoBlock consists of a team of engineers with extensive technology and business backgrounds, united by a passion for innovation, professional development and building high-quality software products. Innovative technologies have the capacity of bringing to life revolutionary ideas that can change and better the world compared to the way we know it.

info@newcryptoblock.io