Regularization of Machine Learning models — A mathematical guide: Part 1

5 min readJun 8, 2024

Constraining a model to make it simpler and reduce the risk of overfitting is called regularization. For example, if a linear model has two parameters, θ₀ and θ₁, this gives the learning algorithm two degrees of freedom to adapt the model to the training data: It can tweak both the height (θ₀) and the slope (θ₁) of the line. (Read my article on Linear regression if you are not familiar with it 😊)

Similarly, a simple way to regularize a polynomial model is to reduce the number of polynomial degrees.

For a linear model, regularization is typically achieved by constraining the weights of the model. We will now look at Ridge Regression, Lasso Regression, and Elastic Net, which implement three different ways to constrain the weights.

Ridge Regression

Ridge Regression adds a regularization term equal to αΣⁿᵢ₌₁ = θᵢ² to the cost function. This forces the learning algorithm to not only fit the data, but also keep the model weights as small as possible. Note that the regularization term should only be added to the cost function during training. Once the model is trained, you want to evaluate the model’s performance using the unregularized performance measure.

Apart from regularization, another reason why the cost function used for training, and the performance measure used for testing might be different , is that a good training cost function should have optimization-friendly derivatives, while the performance measure used for testing should be as close as possible to the final objective. A good example of this is a classifier trained using a cost function such as the log loss, but evaluated using precision/recall.

The hyperparameter α, controls how much you want to regularize the model. If α = 0 then Ridge Regression is just Linear Regression. If α is very large, then all weights end up very close to zero and the result is a flat line going through the data’s mean.

Note that the sum starts at i = 1, not 0, which means that the bias term θ₀ is not regularized. To understand this, we need to understand norms.

Norms

The cost function we’re trying to minimize right now is the RMSE (In practice, it is simpler to minimize MSE, which is why we’re using MSE above instead of RMSE but they’re both the same thing!).

We can also use other performance measures. For example, if there are many outliers, you may consider using the Mean Absolute Error (MAE). RMSE and MAE are both ways to measure the distance between two vectors:

The vector of predictions
The vector of target values

Various distance measures (norms) are possible:

RMSE corresponds to the Euclidean norm: It is the notion of distance you are familiar with. It is also called the ℓ₂ norm, noted ∥ · ∥₂.
MAE corresponds to the ℓ₁ norm, noted ∥ · ∥₁. It is sometimes called the Manhattan norm because it measures the distance between two points in a city if you can only travel along orthogonal city blocks.
More generally, the ℓₖ norm of a vector v containing n elements is defined as:

∥ V ∥ₖ = (|v₀|ᵏ + |v₁|ᵏ + ⋯ + |vₙ|ᵏ)¹ᐟᵏ

ℓ₀ just gives the number of non-zero elements in the vector, and ℓₘ gives the maximum absolute value in the vector. The higher the norm index, the more it focuses on large values and neglects small ones. This is why the RMSE is more sensitive to outliers than the MAE. But when outliers are exponentially rare (like in a bell-shaped curve), the RMSE performs very well and is generally preferred.

Coming back to our Ridge regression equation:

If we define w as the vector of feature weights (θ₁ to θₙ), then the regularization term is simply equal to ½(∥ w ∥₂)², where ∥ w ∥₂ represents the ℓ₂ norm of the weight vector. Now let’s understand why we excluded the bias term:

The bias term θ₀ represents the intercept of the regression line or hyperplane. It shifts the entire prediction function up or down without affecting the slope or the fit of the data.
Regularizing θ₀ would imply penalizing the intercept, which can lead to biased estimates, especially when the data is not centered around the origin. In other words, penalizing θ₀ could move the fitted line or plane away from the true center of the data, which is undesirable.
The purpose of regularization is to prevent overfitting by shrinking the coefficients of the input features (i.e., θ1,θ2,…,θn). This ensures that the model does not become too complex and fits the noise in the data.

The above implementation is for linear regression. For Gradient Descent, just add αw to the MSE gradient vector. (Read about Gradient descent: here)

In practical applications, θ₀ is often not regularized because it allows the model to freely adjust the base level of predictions to better fit the data without any penalty. Regularizing only θ1 through θn focuses on controlling the complexity of the model’s response to the input features, which is the main goal of regularization.

It is important to scale the data (e.g., using a StandardScaler) before performing Ridge Regression, as it is sensitive to the scale of the input features. This is true of most regularized models.

The following figure shows several Ridge models trained on some linear data using different α value:

On the left, plain Ridge models are used, leading to linear predictions. On the right, the data is first expanded using PolynomialFeatures (degree=10), then it is scaled using a StandardScaler, and finally the Ridge models are applied to the resulting features: this is Polynomial Regression with Ridge regularization. Note how increasing α leads to flatter (i.e., less extreme, more reasonable) predictions; this reduces the model’s variance but increases its bias.

As with Linear Regression, we can perform Ridge Regression either by computing a closed-form equation or by performing Gradient Descent. The pros and cons are the same.

The equation below shows the closed-form solution:

Here A is the (n + 1) × (n + 1) identity matrix except with a 0 in the top-left cell, corresponding to the bias term

Here is how to perform Ridge Regression with Scikit-Learn using the closed-form equation:

from sklearn.linear_model import Ridge
ridge_reg = Ridge(alpha=1, solver="cholesky")
ridge_reg.fit(X, y)
ridge_reg.predict([[1.5]])

Here, we use a variant of the closed-form equation using a matrix factorization technique by André-Louis Cholesky. This gives the output:

array([[1.55071465]])

Here is Ridge Regression using Stochastic Gradient Descent:

sgd_reg = SGDRegressor(penalty="l2")
sgd_reg.fit(X, y.ravel())
sgd_reg.predict([[1.5]])

The penalty hyperparameter sets the type of regularization term to use. Specifying “l2” indicates that you want SGD to add a regularization term to the cost function equal to half the square of the ℓ2 norm of the weight vector: this is simply Ridge Regression. This gives us the output:

array([1.47012588])

In the next part, we’ll talk about Lasso Regression. Thanks for reading! 🎉

Regularization of Machine Learning models — A mathematical guide: Part 1

Ridge Regression

Norms

Coming back to our Ridge regression equation:

Written by Chamuditha Kekulawala