7 ways to avoid overfitting

Published in

Analytics Vidhya

5 min readNov 21, 2020

Overfitting is a very comon problem in machine learning. It occurs when your model starts to fit too closely with the training data. In this article I explain how to avoid overfitting.

Overfitting is the data scientist’s haunt. Before explaining what are the methods that we can use to overcome overfitting, let’s see how to detect it.

How to know if a model is overfitting?

In data science, perfect data do not exist. You always have noise and inaccuracies. A model overfits when it starts to learn this noise. The result is a biased model that you can’t generalize.

In practice, a model that overfits is often very easy to detect. Overfitting occurs when the error on the testing dataset start increasing. Typically, if the error on the training data is too much smaller than the error on the testing dataset, your model may have learned too much.

How do we avoid overfitting?

Fortunately several techniques exist to avoid overfitting. In this part we will introduce the main methods.

Cross-validation

One of the most effective methods to avoid overfitting is cross validation.

This method is different from what we do usually. We use to divide the data in two, cross validation divides the training data into several sets. The idea is to train the model on all sets except one at each step. If we have k sets, we will train the model k times with a new testing set at each step. This cross-validation technique is called k-fold.

Well I admit that I’m selling you a dream with the cross validation 🙂

K-fold is mainly used to evaluate the performance of a model. This technique allows the selection of the right learning machine models. When it comes to avoiding overfitting, this method is mainly used to detect it more efficiently. Especially when basic metrics are not enough.

Add training data

Obviously the best solution would be to increase the size of the training data. Having more samples on the training set, allows the model to be more efficient. Conversely, if the model is trained with a small amount of data, it is likely to be biased.

Unfortunately, most of the time all our available data is already used. To cope with this we can use data augmentation techniques.

The idea is simple. We make small changes on our training dataset to increase the variety of samples.

For example, in computer vision, we have images as training data. We can create filters to slightly modify the colors. We can rotate the images or stretch lines. This reduces the risk of overfitting.

Remove features

One of the techniques to improve the performance of a machine learning model is to correctly select the features.

The idea is to remove all features that don’t add any information. If two variables are correlated, for example, it is better to remove one of them. If a feature has a too low variance, it doesn’t have any impact on what we are studiying but can distort the results.

In this way, we simplify our data as much as possible, we improve the performance of the model and we reduce the risk of overfitting.

One way to do this is to train the model several times. Each time we remove one of the features and study the impact on the training of the model. This technique can only be used on data with a small number of features.

On datasets that have too much features, we will need to implement dimension reduction methods.

Regularization methods

Regularization methods are techniques that reduce the overall complexity of a machine learning model. They reduce variance and thus reduce the risk of overfitting.

Here is an example for the case of a logistic regression. We see that before regularization the model was overfitting. Regularization solved the problem.

The regularization methods allow the variance of the model to be considerably reduced without increasing the bias. We will return to the bias/variance dilemma in the last section.

Many regularization techniques exist:

L1
Ridge
L2
Lasso

This article on Medium shows how we choose a regularization method.

Start by designing simple models

The simpler your model is, the more you will avoid overfitting. The majority of applications can be solved with simple models. Think about it!

Early stopping

Early stopping is a very intuitive technique. It simply consists in stopping the training before it overfits.

This requires finding the optimal training time. To avoid both underfitting and overfitting.

Early stopping is often associated with the famous bias/variance dilemma in statistics. In machine learning we talk about the underfitting/overfitting dilemma.

This technique is mainly used in deep learning for training neural networks. For other machine learning models such as Random Forest or SVM, regularization techniques are often more adapted.

To conclude, avoiding overfitting is an art that a good data scientist must master. These different techniques work very well is most of the cases. Nevertheless, having a good understanding of the theoretical aspect of the models you use remains the safest way. Even if it requires to dive quite deeply into the theory of machine learning…