How to stop Overfitting your ML and Deep Learning models

Andrew Schleiss
Geek Culture
Published in
6 min readOct 26, 2021

Overfitting in machine learning and deep learning is a common problem. This is a result of the model being not generalizing the data and as such having a high variance. In this article we will go through a few common ways to mitigate overfitting.

Creator: © Mark Evans | Credit: Mark Evans

We will use some of the most common overfitting solutions using sci-kit learn and Tensorflow code examples

High level solutions to Overfitting

  • Reduce model complexity
  • Regularization (adjust weights)
  • Ensemble models
  • Early stopping
  • More Data
  • Cross validation

Reduce model complexity

Model complexity can be reduced by applying the below to machine and deep learning model respectively:

  • Drop Features / Dimensionality reduction (using PCA)
  • Drop neurons (using Drop Layers)

The most controversial way to stop overfitting a model is to reduce its complexity. This can be done in a few ways but the easiest way is to simply drop some features!!

Now your telling me I'm crazy! isn't more information better?

Generally you would be correct; features that correlate to the target and provide additional information to the feature space will help with the models accuracy or provide a reduction in error. However some features may have information present in another feature or they simply do not impact the target. We can therefore remove redundant features or those with low (or null) correlation to the target.

This can be done manually through trial and error by dropping features that have the same information or by analyzing their correlations, however this requires an in-depth knowledge of the data.

An “automated” process to reduce dimensions/ features can be done using the supervised machine learning algorithm Principal Component Analysis (PCA).

The number of features are reduced to 6 (specified by n_components in PCA)

For deep learning we have additional options to the above, by using Dropout Layers. Which as the name suggests, drops a random % of the neurons from being used in each run (epoch).

At layer 1 and layer 2 we drop 50% (0.5) of the neurons being used

Regularization

Regularization is the practice of manipulating the coefficients or weightings of the inputs.

Photo by Gene Jeter on Unsplash

There are many ways to do this, with some models built to include regularization by default such as regression machine learning models; LASSO (L1 regularization) and Ridge (L2 regularization).

  • L1 regularization — tries to set some of the weights to zero, which eliminates those features from the prediction process.
  • L2 regularization — tries to restrict the weights/coefficients of the features towards zero(but not exactly zero).
Alpha determines the amount of restriction applied to the feature coefficients

The additional deep learning process for regularization uses the class_weight which is set when fitting the model

Example of setting class weights for binary classification

Ensemble

Photo by Eran Menashri on Unsplash

An interesting way to overcome overfitting is to use ensemble models, which takes “weak learner” models and combines them to create a “super” model. This can be done in three ways:

  • Bagging — homogeneous models are run in parallel
  • Boosting — homogeneous models are run in series
  • Stacking — heterogenous models combined

Bagging is the process of predicting the target variable with multiple similar models in parallel and averaging the individual predictions to form a final prediction. An example of this is the Random Forest model, whereby multiple “weaker” decision trees are run in parallel and the resulting outputs are averaged to form the prediction.

Boosting is similar to bagging however models are run linearly (in series), whereby the next model in the series “learns” from the previous models. Popular models, such as XGboost and ADAboost use this process.

Stacking uses different model types to predict the the outcome. This can be seen practically via Voting Regressor or Voting Classifier in sci-kit learn.

Multiple models are fitted to the data with the “hard” voting

An important hyperparameter of the Voting Classifier is “hard voting” which uses the highest probability prediction of all the models or “soft voting” which takes the average.

Unfortunately we don’t have a defined package in Tensorflow that does the same, however we can replicate this manually by running multiple deep learning models and averaging the results. An in-depth example can be found here

Early stopping

Early stopping is usually noted in Deep Learning models where the epochs (iterations of model) are halted when the performance against the testing/validation data starts to degrade.

Photo by Possessed Photography on Unsplash

We can see an example using the EarlyStopping module in Tensorflow, where we set the number of epochs very high

Large number of epochs and callback parameter set to EarlyStopping
Left graph without early stopping. Right graph includes early stopping

We can do this in machine learning by setting parameters in certain sci-kit learn models, such as n_estimators in Random Forest or n_jobs in Linear Regression

More Data

This is always good option, however we need to be careful to not include features where the information is already present in others (see Reduce Model Complexity above)

Photo by Alexander Sinn on Unsplash

One way to do this is obtaining more samples of data, however this isn’t always possible.

Another way is through feature engineering where we derive additional features from others. This can be shown best in our Kaggle Titanic submission where we extract additional features such as the length of the passengers name, title of passenger (Mr, Miss etc.), grouping/binning of features etc.

Cross Validation

Cross validation is a resampling process where the dataset is split into k number of groups , where certain subsets of each group are used for training the model and others are used for validating or testing the model.

K-fold cross validation where k=3

The model is then evaluated on how well it did on the test data, also called the “unknown data” as this data wasn’t used in fitting/training the model.

There are numerous cross validation methods, such as K-fold and Leave-One-Out etc. However the simplest and most well known method is hold-out; where a portion of the data is set aside for later testing. This is most commonly used with the sci-kit learn module train_test_split:

Example of using hold-out cross validation with train_test_split

Conclusion

This was some quick and dirty examples to mitigate overfitting, while not delving to in-depth into each step. I recommend that if you use any of the above steps that you investigate the process fully before implementing, as some of these steps can hamper model predictions.

For an example of mitigating overfitting using Early stopping and Dropout layers have a look at our Kaggle notebook: Combat Overfitting with Early Stopping & Dropout

--

--