REGULARIZATION IN NEURAL NET

Airton Kamdem
7 min readFeb 28, 2022

--

In this essay, I want to cover the implications of regularization on a model that is notoriously prone to overfitting — Neural Nets. I’ll briefly discuss key pillars needed to create a basic neural net prior to regularization. For any model that is expecting a binary output, also known as binary classification, you should design a single node, as your output will always be sigmoid. If it were a regression problem, you’d use a linear function which essentially does not transform the outputs at all.

For instances of a multiclass output, you might require 3 or more nodes and a SoftMax which also has the ability to predict probabilities of different instances — however, you do need a node for each of the different classifications. As you’re training, testing, and splitting your data, classification problems also always require that you stratify your y, and scaling your data is crucial for creating measurement parity across features.

Keeping some of these broader points in mind, you can begin to make your topography by instantiating the sequential class — this is the skeleton, and includes one input (Sequential ()), 3 total layers, including the inputs, and hidden layers (model.add()). Next, you’ll want to specify a dense layer, ideally using 70 nodes, and set the appropriate input dimension (input_dim = X_train.shape[1]) — this should make up the first layer, but please keep in mind that all activation layers are typically going to be RELU or rectified linear units.

The next layer should be a dense layer, usually containing fewer nodes (20). At this stage, you don’t have to specify output dimensions, but you do have to specify the activation (RELU). Lastly, the third layer should also be dense, but fully connected, with a single node since we are doing binary classification — this is where our activation function can be set to sigmoid. As previously stated, another benefit of Sigmoid is that it predicts probabilities between zero and one on a binary output layer

Once set, you compile the model, please also specify 2 additional details: the optimizer as Adam, and the loss function as binary cross-entropy(bcc) — This gives the model something to keep track of since it’s a binary classification. The metrics list will have its focus on accuracy. You can then fit the neural net with validation as a tuple and the number of epochs as 100. Feel free to save learnings as a history object or variable to save learnings and compare over time. Batch sizes can be set to none if the dataset is relatively small, so we pass through all the data at once, and verbose statements define how much of a print statement we get.

Prior to regularization, your model might resemble the graph below, where, as we approach 100 epochs, it is clear that the learning and predictions on the training data sets vastly diverge from those on the testing dataset, this renders our model almost useless for any unseen data because it is so incredibly well adjusted to the nuances and noise on the training data set.

When interpreting the model, if you notice that the Val acc is lower in unseen data, this usually signifies overfit. From here, we’ll discuss a specialized approach to regularizing given how frequent this problem is with Neural Nets.

1) As with Linear regression, you can add a lasso and ridge regularization on Neural Nets, however, remember that with Lasso(L1), features converge on and eventually snap to zero as the loss occurs — so if you’d rather just reduce the power of the node, Ridge(L2) may be a better approach.

To add L2, you’d follow the same model creation process, however when adding the first dense layer (previously set to 70 nodes), you’ll want to specify which kernel get the regularization, can do this by specifying kernel_regularizer as L2, with L2’s input set to the alpha (i.e .001). This application should also be set on the next dense layers so in 3 total places.

After this simple adjustment, as we replot the distribution across both charts, what we see and notice below is that we do have a bit more loss on the L2 regularized data but this model is significantly more successful at keeping the training set’s loss closer to where the training predictions are. This model is already much more useful than the former as it generalizes better on unseen data.

2) In a neural net, there are a lot of parameters and nodes that are constantly being set and adjusted, this is part of what may ultimately contribute to overfitting, so what we can actually do is tell the model to completely ignore a few parameters or nodes as it learns on the data — this approach is known as the dropout method. It turns out that changing which nodes are on and off randomly, through what is essentially a coin flip-like probability (1%, or 50% for example) introduces redundancy and limits overfitting. Keep in mind that when it is time to run and initiate predictions, every single node will be turned back on — this is best applied in the training stage.

Neural networks are prone to overfitting but they aren’t necessarily always guaranteed to overfit. A best practice, in this case, is usually to start out by building the strongest neural net possible and observe the extent to which it overfits. If the impact is unacceptably noticeable, this is when you might begin introducing this type of tactic because it does bring in a healthy amount of bias and forces your model to overlook certain complexities in your data set. This can be beneficial, but also counterproductive if not used appropriately. Additionally, please remember to not set any keeping probabilities for layers where you do not require or need to turn off nodes, this includes your input and output layers for example — hidden layers are fair game toward this end.

Ultimately, the best way to bring this to life is within our dense layers by setting a Dropout rate to whichever figure makes the most sense (.2 for 20%, etc.) — any layer that is not an input, output and otherwise has an activation function should be issued a dropout rate.

Once you’ve run the model, you may still notice that the model is better at regularizing than the null but still woefully inadequate at keeping the testing and training predictions in line. In such cases, you may simply consider returning to your model to increase the dropout rate, and in most cases, you will almost certainly come out with a model that is better adjusted for unseen or real-world data. A data scientist could also use the first and second regularization steps here together, so instead of turning off additional nodes or pathways within your neural net, you could just introduce an L2 with a small alpha and adjust this parameter as you see fit.

3) The third regularization option for Neural Nets is known as early stopping. If you observe the progress of the lines above, you will notice that early on, there are points where the testing and training data are perfectly aligned, and as we progress in epochs, the separation begins to widen significantly. What we can do in this case is just stop the model right at the beginning before the divergence becomes egregious. The one caution with this approach, however, is that you might have a model that has multiple validation error minimums, meaning that there is likely a local and global maximum. As much as possible, you want to avoid stopping early on the local minimum because you’d forgotten the additional learnings that the model has to offer. The best way to avoid this is to try different stopping ranges and observe how this impacts accuracy across epochs. The visual below provides a representation of this.

In this approach, you’ll want to import EarlyStopping from TensorFlow Keras Callbacks.

In practice, everything once again will look the same in terms of the model’s topography with the exception of instantiation EarlyStopping(monitor= ‘val_loss’, metrics[‘acc’], patience=5, verbose=1, mode=’auto’ ) being set after the model has been compiled, and further down, as you fit the model, you want to set callbacks to early_stop Included in this statement will be the monitor, which is what you want the model to keep track of, the minimum delta, which is how much increase you’d need to see invalidation loss before stopping, and the patience is just how many epochs you want the model to wait before completely stopping just in case your model runs into a local minimum. For example, setting this at 10 would mean that if your loss is moving in the same general direction for 10 consecutive epochs, then you’ve likely reached a global maximum/minimum point. A visual of the early stopping procedure is provided below. Paying attention to the light pink line, you can notice just where the model started to diverge, and quickly stop learning, the result is a model is likely the best of the three approaches tested.

--

--