Regularization of Machine Learning models — A mathematical guide: Part 2

Chamuditha Kekulawala
5 min readJun 9, 2024


In part 1 we talked about using Ridge regression for regularization. Now let’s talk about Least Absolute Shrinkage and Selection Operator Regression (simply called Lasso Regression)

Lasso Regression

Just like Ridge Regression, it adds a regularization term to the cost function, but it uses the ℓ1 norm of the weight vector instead of half the square of the ℓ2 norm. (We talked about norms in part 1):

The following figure shows several Lasso models trained on some linear data using different α value:

Compare this to the Ridge regression models we got for the same set of data:

An important characteristic of Lasso Regression is that it tends to completely eliminate the weights of the least important features (i.e., set them to zero). For example, the dashed line in the right plot on Lasso regression (with α = 10⁻⁷) looks quadratic, almost linear: all the weights for the high-degree polynomial features are equal to zero. In other words, Lasso Regression automatically performs feature selection and outputs a sparse model (i.e., with few non-zero feature weights).

You can get a sense of why this is the case by the comparison of Lasso vs Ridge regression in the following figure:

On the top-left plot, the background contours (ellipses) represent an unregularized MSE cost function (α = 0), and the white circles show the Batch Gradient Descent (BGD) path with that cost function. The foreground contours (diamonds) represent the ℓ1 penalty, and the triangles show the BGD path for this penalty only (α → ∞). Notice how the path first reaches θ₁ = 0, then rolls down a gutter until it reaches θ₂ = 0.

On the top-right plot, the contours represent the same cost function plus an ℓ1 penalty with α = 0.5. The global minimum is on the θ2 = 0 axis. BGD first reaches θ2 = 0, then rolls down the gutter until it reaches the global minimum. The two bottom plots show the same thing but uses an ℓ2 penalty instead. The regularized minimum is closer to θ = 0 than the unregularized minimum, but the weights do not get fully eliminated.

On the Lasso cost function, the BGD path tends to bounce across the gutter toward the end. This is because the slope changes abruptly at θ2 = 0. You need to gradually reduce the learning rate in order to actually converge to the global minimum.

The Lasso cost function is not differentiable at θi = 0 (for i = 1, 2, ⋯, n), but Gradient Descent still works fine if you use a subgradient vector g instead when any θᵢ = 0.

You can think of a subgradient vector at a nondifferentiable point as an intermediate vector between the gradient vectors around that point.

The following equation shows a subgradient vector equation you can use for Gradient Descent with the Lasso cost function:

Here is a small Scikit-Learn example using the Lasso class:

from sklearn.linear_model import Lasso
lasso_reg = Lasso(alpha=0.1), y)

Note that you could instead use an SGDRegressor(penalty=”l1").

Here is the output we get from the above code:


(The output we got from Ridge Regression is: array ([1.47012588]))

Elastic Net

Elastic Net is a middle ground between Ridge Regression and Lasso Regression. The regularization term is a simple mix of both Ridge and Lasso’s regularization terms, and you can control the mix ratio r.

  • When r = 0, Elastic Net is equivalent to Ridge Regression
  • When r = 1, it is equivalent to Lasso Regression

Here is a short example using Scikit-Learn’s ElasticNet:

from sklearn.linear_model import ElasticNet
elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5), y)

Here, l1_ratio corresponds to the mix ratio r. This gives the output:


When should you use Regularization?

It is almost always preferable to have at least a little bit of regularization, so generally you should avoid plain Linear Regression. Ridge is a good default, but if you suspect that only a few features are actually useful, you should prefer Lasso or Elastic Net since they tend to reduce the useless feature weights down to zero as we discussed above.

In general, Elastic Net is preferred over Lasso since Lasso may behave erratically when the number of features is greater than the number of training instances or when several features are strongly correlated.

Early stopping

A very different way to regularize iterative learning algorithms such as Gradient Descent is to stop training as soon as the validation error reaches a minimum. This is called early stopping. The following figure shows a high-degree Polynomial Regression model being trained using Batch Gradient Descent:

As the epochs go by, the prediction error (RMSE) on the training set naturally goes down, and so does its prediction error on the validation set. However, after a while the validation error stops decreasing and actually starts to go back up. This indicates that the model has started to overfit the training data. With early stopping you just stop training as soon as the validation error reaches the minimum. It is such a simple and efficient regularization technique that it’s called a “beautiful free lunch.” 😅

With Stochastic and Mini-batch Gradient Descent, the curves are not so smooth, and it may be hard to know whether you have reached the minimum or not. One solution is to stop only after the validation error has been above the minimum for some time (when you are confident that the model will not do any better), then roll back the model parameters to the point where the validation error was at a minimum.

Here is a basic implementation of early stopping:

from sklearn.base import clone

# prepare the data
poly_scaler = Pipeline([
PolynomialFeatures(degree=90, include_bias=False)),
X_train_poly_scaled = poly_scaler.fit_transform(X_train)
X_val_poly_scaled = poly_scaler.transform(X_val)

sgd_reg = SGDRegressor(max_iter=1, tol=-np.infty, warm_start=True,
penalty=None, learning_rate="constant", eta0=0.0005)
minimum_val_error = float("inf")
best_epoch = None
best_model = None
for epoch in range(1000):, y_train) # continues where it left off
y_val_predict = sgd_reg.predict(X_val_poly_scaled)
val_error = mean_squared_error(y_val, y_val_predict)
if val_error < minimum_val_error:
minimum_val_error = val_error
best_epoch = epoch
best_model = clone(sgd_reg)

Note that with warm_start=True, when the fit() method is called, it just continues training where it left off instead of restarting from scratch.

