The problem of Overfitting in Machine Learning

Daniel Torres Candil
9 min readSep 3, 2023

--

Picture yourself as a culinary artist, crafting a masterpiece dish with the finest ingredients. You are aiming to achieve the perfect balance ⚖️ of flavours and textures that will leave your diners delighted. In the world of Machine Learning, a similar delicate balance comes into play when we talk about bias and variance, two pivotal elements that orchestrate the performance of the ML models. This captivating waltz holds the key to unraveling a common headache — overfitting of models.

Think of overfitting as a chef who has mastered 🤩 a recipe to the point of precision but struggles 😫 when adapting to the spontaneous nuances of a live cooking competition.

Keep reading this article as we plunge ahead. We will embark on an expedition to demystify overfitting, unravel the intricate relationship between bias and variance, and equip ourselves with strategies to master the culinary art of balancing machine learning models.

What is Overfitting?

The goal of any supervised learning algorithm is crystal clear: craft the most accurate map function (let’s call it ‘f’) to predict the outcome (Y) based on the input data (X).

Whenever we embark on the journey of model prediction, it is important to understand the nuances of prediction errors. The errors that sneak into the predictions for any machine learning algorithm can be neatly classified into three categories: bias, variance and irreducible error.

Total error calculation for a ML model

The irreducible error can not be reduced regardless the algorithm used. It is the result caused by factors outside our control, such as statistical noise and unknown variables. Essentially, even if we build the most perfect model, there will always be some variables outside of X — and independent of X — that will have some small effect on Y. The only way to improve this irreducible error is to identify these outside influences and incorporate them as predictors. However, it is important that no matter how good we make our model, our data will always have certain amount of noise or irreducible error that can not be removed.

So, with the irreducible error in its place, our focus shifts to the dynamic duo of bias and variance — errors that we can mold, shape, and influence with our machine learning magic. Gaining a proper understanding of these would help us not only to build accurate models but also to avoid the mistake of overfitting and underfitting.

Bias Error

In machine learning, bias can be calculated as the difference between the average prediction of a model and the correct value which we are trying to predict. Models with high bias pay little attention to the training data and oversimplify the model. This always leads to high error on training and test data, thus this results in underfitting.

  • Examples of low-bias machine learning algorithms include Decision Trees, K-Nearest Neighbours and Support Vector Machines.
  • Examples of high-bias machines learning algorithms include Linear Regression and Logistic Regression.

Variance Error

It is the variability of model prediction for a given data point. It can also be described as the value which tells us how spread is our data. Models with high variance pay a lot of attention to training data and don’t generalise to previously unseen data. As a result, such models perform very well with training data, but have high error rates with test data (overfitting).

  • Examples of low-variance machine learning algorithms include Linear Regression, and Logistic Regression.
  • Examples of high-variance machine learning algorithms include Decision Trees, K-Nearest Neighbours and Support Vector Machines.

✅ The goal of any supervised machine learning algorithm is to have both low bias and low variance, so it can achieve good prediction performance. Linear algorithms often have high bias but low variance, while nonlinear algorithms often have low bias and high variance. There is no escaping the relationship between these two in machine learning. Therefore, the parametrisation of machine learning algorithms is often a battle to balance out these errors.

If you reduce one (bias), the other (variance) tends to rise, and vice versa. So, tuning a machine learning model is like tuning a musical instrument. You want the strings neither too tight (high bias) nor too loose (high variance). You’re aiming for the perfect harmony in the middle.

Addressing Overfitting

Overfitting occurs when a model goes overboard trying to remember the training data instead of grasping the fundamental patterns it should. This results in the model making accurate predictions for the data it already knows but stumbling when faced with new, unseen data. To truly excel, a model must generalise, meaning it should be able to make accurate predictions for all sorts of data, not just what it was trained on.

Overfitting could happen for various reasons, including:

  • Having too little training data, which doesn’t give the model a comprehensive view of all possible data scenarios.
  • Dealing with noisy data, where the training data contains lots of irrelevant or distracting information.
  • Letting the model train for an extended period on the same dataset, leading it to memorize rather than understand.
  • Using a highly complex model, causing it to pick up on the noise within the training data instead of focusing on the essential patterns.

So the next question is, how can we identify overfitting?

Identifying overfitting in machine learning is crucial to ensure the model’s generalisation ability and prevent it form memorising the training data without learning the underlying patterns. In this discussion, we’ll explore some commonly employed techniques to spot overfitting.

Holdout Validation

In machine learning, holdout validation involves dividing the dataset into two distinct portions: a training set and a validation set. The model is then trained on the training set and assessed for its performance using the validation set. If the model exhibits notably superior performance on the training set compared to the validation set, it may be a sign of overfitting.

Cross-Validation

Rather than relying on a single train-test split, a more robust approach is to employ k-fold cross-validation. This method entails partitioning the dataset into “k” subsets, often referred to as “folds”. The model is then trained and evaluated “k” times, with each iteration using a different fold as the validation set and the remaining folds as the training set. If the model’s performance exhibits significant fluctuations across these different folds, it could be a sign of overfitting.

Learning Curve Analysis

A learning curve analysis involves charting the training and validation scores against the number of epochs or iterations during model training. Overfitting becomes apparent in the learning curve when:

  1. A noticeable divergence emerges between the training and validation scores.
  2. The validation error starts to rise at a certain point while the training error continues to decrease.

Validation Curve Analysis

In traditional machine learning models, particularly when dealing with overfitting, the validation curve is a valuable tool. It’s an alternative to the learning curve, most commonly seen in deep learning models. The validation curve is all about assessing how a single hyperparameter affects both the training and validation sets.

Imagine the x-axis representing different values of a specific hyperparameter, while the y-axis depicts the corresponding training and validation scores. This curve allows us to pinpoint when overfitting starts occurring for a given hyperparameter’s range. To achieve this, we identify the most critical hyperparameter for our model and then graphically visualize how different values of this hyperparameter affect performance using the validation curve.

For instance, let’s take a random forest classifier as an example. We’re interested in understanding how the “max_depth” hyperparameter influences the accuracy scores for both training and validation data.

In the resulting plot, you’ll notice that beyond a “max_depth” value of 6, the model starts to overfit the training data. Here’s the indicator sign: the validation accuracy begins to decline at “max_depth=6” while the training accuracy keeps climbing. This inflection point is a clear indicator of overfitting, highlighting the importance of choosing the right hyperparameter values for your machine learning model.

Evaluation Metrics

Evaluation metrics are powerful tools in detecting overfitting because they provide a quantitative and objective means of assessing a model’s performance. By comparing the metric results on the training and validation sets, we can detect overfitting. If the model’s performance on the training data is significantly better than on the validation data, it’s a sign of overfitting.

The choice of model evaluation metrics depends on the specific machine learning algorithm being used. We often employ a combination of these metrics for a more comprehensive analysis.

How to prevent overfitting?

We’ve already learned about the dangers of overfitting. Now, let’s become protectors and learn how to stop it from happening.

In this section of the post we will explore how to prevent overfitting using simple strategies. These strategies act like shields, ensuring our models remain robust and reliable. So, let’s dive into these protective measures.

More Data, Less Overfitting

Imagine overfitting as trying to learn form a tiny piece of a puzzle. You’re bound to get the picture wrong. More data pieces make the puzzle clearer.

Gathering more data is one of the most potent remedies for overfitting. With a larger, more diverse dataset, your model has a better chance to discern the real patterns amidst the noise.

Simplicity is Key

A concise, clear story is often more powerful than a convoluted one.

Overfitting often happens when the model is too complex. The main reason for model complexity is the existence of many features in the data (high dimensionality). The model tends to overfit when the dimensionality is high. Reducing the number of features in the data is a good approach, although we should keep as much of the variance in the original dataset as possible; otherwise, we would lose useful information.

The most common dimensionality reduction method is called Principal Component Analysis (PCA), which creates a new set of uncorrelated features for the data in a lower dimensional form. We will reserve a post in the future for this technique exclusively.

Feature Selection

It can be considered as a dimensionality reduction method as it removes redundant (unnecessary) features from the dataset. Some features or parameters can be identified to impact the final prediction when building a model. Feature selection — or pruning — identifies the most important features within the training set and eliminates irrelevant ones, reducing data dimensionality.

Unlike some other methods that create entirely new sets of information, like PCA, feature selection doesn’t change the original values. That’s how feature selection differs from PCA, where we get new transformed values.

Early Stopping

Early stopping stands as a potent countermeasure against overfitting, primarily used in iterative learning algorithms such as neural networks and gradient boosting.

It involves monitoring the model’s performance on a validation set by looking at the learning or validation curve during training. If the model’s performance on the validation set starts to decrease, model training is stopped early.

The idea behind early stopping is that as the model continues to learn from the training data, it might start to memorise noise, leading to overfitting. By stopping training early, we can avoid reaching a point where the model overfits the training data and instead keep the model with better generalisation.

Regularisation

Regularisation is a technique used to reduce errors by fitting the function appropriately on the given training set and avoid overfitting. Techniques like L1 (Lasso), L2 (Ridge) or a combination of both (Elastic Net), can help control overfitting by adding penalties to complex model components.

As with PCA, we will reserve an exclusive post in the future for the regularisation technique.

By incorporating these strategies into your machine learning adventures, you’re equipping yourself with the tools to conquer overfitting and ensure your models are ready for the real-world challenges they may face.

Summary

In our journey through the world of machine learning, we’ve uncovered the elusive concept of overfitting and equipped ourselves with a toolkit to combat it. From understanding the nuances of overfitting to learning strategies like cross-validation, regularization, and early stopping, we’ve fortified our understanding. But our adventure is far from over. So, stay tuned as we continue to unravel the mysteries and master the art of machine learning together. There are captivating discoveries and knowledge awaiting us just around the corner!

--

--