Overfitting, and avoiding it with regularization

Sean Gahagan
3 min readOct 9, 2022

--

My last note looked at classification models and two important metrics for evaluating the performance of classification: recall and precision.

This not will look at the problem of overfitting machine learning models and a best practice to avoid it called regularization.

What is overfitting?

Overfitting means that your model fits the training set really well, but is not able to generalize its learnings to new examples and make accurate predictions. Overfitting is also known as high variance, and the opposite of overfitting is known as under fitting or high bias (which does not refer to social bias).

To illustrate this concept, recall our housing price prediction example of supervised learning where we have a model predicting home sale price based on square footage. If we use a simple linear regression with a single feature, this model may not do a great job of fitting the data (an example of under fitting or high bias). We could create a polynomial regression model that uses many features so that our hypothesis function passes through each training example, but this likely wouldn’t be very valuable in predicting other home sale prices (an example of over fitting or high variance). The best model structure in this example likely lies somewhere in between, as shown in the picture below.

A simple approach to mitigating overfitting

As you could see from the housing price example above, one way to mitigate overfitting is to reduce the number of features used by the model. You could simply manually select which features to use, or you could test out different combinations of features in different models and pick the one that performs best on your test data.

Unfortunately, this can hinder your ability to build models that use a large number of features.

Regularization

Regularization is a way of avoiding overfitting your models when you’re still using many features. Instead of reducing the number of features, regularization reduces the value of the features’ parameters. You can think of this as smoothing out the hypothesis function.

Mathematically, the way regularization does this is by adding a new term to the cost function that represents the sum of all the features’ parameters. Then, as the learning algorithm is training the model to reduce the cost function (through gradient descent or another optimization approach), it is also seeking to minimize the parameter values of each feature, since this is now included in the cost function.

Scaling inputs with regularization

With regularization, machine learning models can continue to use really really large numbers of features to inform their predictions while still mitigating overfitting.

Up Next

My next note will begin looking at neural networks.

Past Notes in this Series:

  1. Towards a High-Level Understanding of Machine Learning
  2. Building Intuition around Supervised Machine Learning with Gradient Descent
  3. Helping Supervised Learning Models Learn Better & Faster
  4. The Sigmoid function as a conceptual introduction to activation and hypothesis functions
  5. An Introduction to Classification Models

--

--