Tuning a neural net on a noisy dataset

Effects of L2 regularization, Neuron dropout and Early stop methodologies on model performance

Harsha Goonewardana

5 min readJul 19, 2018

Background

In my previous post, I discussed model selection for best prediction results for Airbnb prices in New York City.

Baseline value is the mean of the price variable: $178.60.

XGBoost, a CART model augmented with gradiant boosting provided the best prediction XG Boost RSME score of $71.24

Dataset

The dataset contains 47,542 unique listing locations from 20th January 2009 to 15th May 2018. There are 95 separate features in a variety of types and 35 were selected for further examination. The data was scaled before modeled.

After creating dummies form the neighbourhoods and unique amenities columns, the final shape of the dataset was :

(37260, 373)

Models

Untuned seven-layer neural network

#Set the input and output shape
input_dim= X_train.shape[1]
output_dim= 1# Create the model
model= Sequential()#Specify input layer
model.add(Dense(373, input_dim=input_dim, activation='relu'))#Add hidden layers 
model.add(Dense(256, activation='relu'))
model.add(Dense(128, activation='relu'))
model.add(Dense(50, activation='relu'))
model.add(Dense(20, activation='relu'))#Specify output layer
model.add(Dense(output_dim))#Compile 
model.compile(loss='mse', optimizer='adam', metrics=['mse'])#fit
history = model.fit(X_train_s, y_train, validation_data=(X_test_s, y_test), epochs=50, verbose=2)

The final RSME was $84.5 which is better than the baseline but underperformed the XGBoost algorithm. The lowest RSME score was $77.68 at epoch 4–5.

Let’s visualize the loss curve:

train_loss=history.history['loss']
test_loss=history.history['val_loss']
plt.plot(train_loss,label='Training loss')
plt.plot(test_loss,label='Test loss')
plt.title("Untuned Neural Network Loss Function")
plt.legend();

This visualizations shows a large divergence in the training and test sets. This indicates heavy overfitting.

2. Same network with L1-L2 regularization

There are two forms of regularization used in neural nets. Both use a tuning parameter to reduce the weights of each connection to reduce the MSE of each path.

In L1 regularization, the weights shrink by a constant amount toward 0. In L2 regularization, the weights shrink by an amount which is proportional to w. When a particular weight has a large magnitude, |w||w|, L1 regularization shrinks the weight much less than L2 regularization. By contrast, when |w||w|is small, L1 regularization shrinks the weight much more than L2 regularization. The net result is that L1 regularization tends to concentrate the weight of the network in a relatively small number of high-importance connections, while the other weights are driven toward zero. [1]

# Create the model
model= Sequential()#Specify input layer
model.add(Dense(373, input_dim=input_dim, activation='relu',
               kernel_regularizer=regularizers.l2(0.01)))# Add hidden layers 
model.add(Dense(256, activation='relu',
               kernel_regularizer=regularizers.l2(0.01)))
model.add(Dense(128, activation='relu',
               kernel_regularizer=regularizers.l2(0.01)))
model.add(Dense(50, activation='relu',
               kernel_regularizer=regularizers.l2(0.01)))
model.add(Dense(20, activation='relu',
               kernel_regularizer=regularizers.l2(0.01)))#Specify output layer
model.add(Dense(output_dim))#Compile
model.compile(loss='mse', optimizer='adam', metrics=['mse'])#fit
history = model.fit(X_train_s, y_train, validation_data=(X_test_s, y_test), epochs=50, verbose=2)

The final RSME was $86.58 which is better than the baseline but underperformed the untuned network The lowest RSME score was $76.93 at epoch 6.

Dropout:

Instead of modifying the cost function, this method reaches under the hood of the network and modifies the network archtecture instead.

At the 0.5 dropout rate, half the hidden neurons are removed from network. This reduces the dependency on particular neural values to determine the output, thereby mitigating the sensitivity of the output to each neuron resulting in less dependence on the training dataset.

This in turn results in a network that is capable of better generalization and is less likely to overfit the training data.[2]

model= Sequential()
model.add(Dense(373, input_dim=input_dim, activation='relu',
               kernel_regularizer=regularizers.l2(0.01)))
#Dropout layer 1
model.add(Dropout(0.5)) # Ben Shaver suggestion
model.add(Dense(256, activation='relu',
               kernel_regularizer=regularizers.l2(0.01)))
#Dropout layer 2
model.add(Dropout(0.5))
model.add(Dense(128, activation='relu',
               kernel_regularizer=regularizers.l2(0.01)))
#Dropout layer 3
model.add(Dropout(0.5))
model.add(Dense(50, activation='relu',
               kernel_regularizer=regularizers.l2(0.01)))
#Dropout layer 4
model.add(Dropout(0.5))
model.add(Dense(20, activation='relu',
               kernel_regularizer=regularizers.l2(0.01)))
#Dropout layer 5
model.add(Dropout(0.5))
model.add(Dense(output_dim))#Compile 
model.compile(loss='mse', optimizer='adam', metrics=['mse'])#fit
history = model.fit(X_train_s, y_train, validation_data=(X_test_s, y_test), 
                    epochs=50, verbose=2)

The final RSME was $90.97 which is better than the baseline but underperformed all previous networks. The lowest RSME score was $82.64 at epoch 11.

The overfitting problem seems to plague the models, possibly due to the paucity of the data when compared to the high number of hidden layers.

Early Stopping

Early stopping methodology calculates the difference in the loss function for each epoch and terminates the network at the point where imporovement stops. This method simplifies the model by removing the need for user input on the number of epochs and delegates the decision to the performance of the other hyper-parameters. Validation can be used to detect when overfitting starts during supervised training of a neural network; training is then stopped before convergence to avoid the problem.[3]

callbacks = [EarlyStopping(monitor='val_loss', patience=6)]
input_dim= X_train.shape[1]
output_dim= 1model= Sequential()
model.add(Dense(373, input_dim=input_dim, activation='relu',
               kernel_regularizer=regularizers.l2(0.01)))
#Dropout layer 1
model.add(Dropout(0.5)) # Ben Shaver suggestion
model.add(Dense(256, activation='relu',
               kernel_regularizer=regularizers.l2(0.01)))
#Dropout layer 2
model.add(Dropout(0.5))
model.add(Dense(128, activation='relu',
               kernel_regularizer=regularizers.l2(0.01)))
#Dropout layer 3
model.add(Dropout(0.5))
model.add(Dense(50, activation='relu',
               kernel_regularizer=regularizers.l2(0.01)))
#Dropout layer 4
model.add(Dropout(0.5))
model.add(Dense(20, activation='relu',
               kernel_regularizer=regularizers.l2(0.01)))
#Dropout layer 5
model.add(Dropout(0.5))
model.add(Dense(output_dim))#Compile 
model.compile(loss='mse', optimizer='adam', metrics=['mse'])#fit
history = model.fit(X_train_s, y_train, validation_data=(X_test_s, y_test), epochs=50, verbose=2,callbacks=callbacks,)

The early stopping protocol stopped the neural network at epoch 17 as the training and test rates started to diverge. The RSME was $92.95, the worst score to this point.

Is less really more? 5 is better than 7

We have seen that all these methodologies do not increase the model performance over XGBoost. This can be due to the realtive paucity of observations and the resultant overfitting. I reduced the number of layers to five to see if that would solve this problem.

# set early stopping parameters
callbacks = [EarlyStopping(monitor='val_loss', patience=6)]model= Sequential()
model.add(Dense(373, input_dim=input_dim, activation='relu',
               kernel_regularizer=regularizers.l2(0.01)))
#Dropout layer 1
model.add(Dropout(0.5))
model.add(Dense(128, activation='relu',
               kernel_regularizer=regularizers.l2(0.01)))
# #Dropout layer 2
# model.add(Dropout(0.5))
# model.add(Dense(50, activation='relu',
#                kernel_regularizer=regularizers.l2(0.01)))
#Dropout layer 3
model.add(Dropout(0.5))
model.add(Dense(output_dim))#Compile 
model.compile(loss='mse', optimizer='adam', metrics=['mse'])#fit
history = model.fit(X_train_s, y_train, validation_data=(X_test_s, y_test), epochs=50, verbose=2,callbacks=callbacks,)

This was by far the best performance of a nueral network. The RSME was $75.87, which is very clsoe to the XGBoost prediction