On Parameter Selection and Shrinkage

For CS610 Applied Machine Learning

James Koh, PhD
MITB For All
6 min readMay 4, 2024

--

This is a post on Piazza many terms ago, by Prof Dai Bing Tian (unable to tag him here) in response to a student’s question. I’ve resurfaced it, and at the same time added more details for those who prefer futher elaboration.

Context

A key part is machine learning is about making predictions. We train a model on seen data, ie. samples, and want it to be generalizable enough to perform well on other unseen data. In this post, we will focus on just a simple regression model.

Overfitted models tend to be less generalizable, and hence, there are benefits to managing the model complexity. There is no single “unquestionable truth” regarding the definition of complexity. Also, there are different ways in which we approach this matter.

The question

A student asked (a long time ago) — ```Why does reducing the norm of w help to reduce the model complexity? Consider two regression models, m₂: y = 3x₁+5x₂ and m₃: y = 10x₁? The norm, be it L1 or L2 norm, of the latter (m₃: y = 10x₁) is larger than that of the former (m₂: y = 3x₁+5x₂). However, the latter (m₃: y = 10x₁) depends only on one variable, whereas the former (m₂: y = 3x₁+5x₂) depends on two variables. Therefore, which is a more complex model?```

The answer

m₃ is more complex than m₂ on the measure of parameter shrinkage, but less complex on the measure of parameter selection. Both parameter selection and parameter shrinkage are possible choices for model
complexity measures.

It is common to use lasso or ridge regression, or a combination eg. elastic net, to ‘simplify’ the model. However, there are also other approaches such as Least Angle Regression and Partial Least Squares. If you are interested in the details, you may refer to this Towards Data Science article for additional reading.

Suppose we want to design a regularized regression model with the parameter selection. That is, to reduce the number of dependent variables in the model. One might think of doing the following (which is NOT discussed in class) to maximize the number of wⱼ being set to zero.

Image from Prof Dai’s piazza post

Let’s look at this in separate components.

J(w) is the loss, which is a function of the parameters w. Meanwhile, 𝛿(wⱼ,0) is equals to 1, if and only if wⱼ = 0, and equals to 0 otherwise. Therefore, the equation above means that we want to find some ideal w which concurrently minimizes the loss J while at the same time maximizing the number of wⱼ for j = 1, 2, …, m, where m is the number of features.

The problem with the above equation is that the function is difficult to optimize — it is not even continuous. (Whereas the L2 norm is differentiable, and the L1 norm is at least continuous.)

This is why the above equation is not used in practice, and we typically use Lasso or Ridge instead for regularization to manage the complexity of the regression model.

Note

While we can make a comparison between (m₂: y = 3x₁+5x₂) and (m₃: y = 10x₁) for a particular dataset, the coefficients are relevative to the nature of y, and therefore it would not be fair to compare the complexity between different models predicting different outputs on different datasets, particularly when normalization is not performed on y.

About understanding Lasso and Ridge

Viewpoint #1

This is what you have seen in class. There are two components of the loss, which we want to jointly minimize. The first component is J(w), which as mentioned above is a function of the parameters w. In the diagram, we simply assume that the are only two features x₁ and x₂ (eg. like the example m₂: y = 3x₁+5x₂), such that the model outputs, and hence the loss, depends on w₁ and w₂.

Screenshot (with minor modifications) from lecture notes 2.

The concentric ovals indicates losses at different values (think about contours indicating the height above sea-level of a mountain). The point in the middle, which appears as a ‘dot’, indicates the lowest possible J(w). On the given train dataset, no other combination of w₁ and w₂ can lead to a ‘better’ overall prediction (although ‘better’ depends on our definition of the loss function). However, once we add the L1 or L2 loss, this would no longer be the overall minimum.

This brings us to the diamond (for L1) or circle (for L2) centered about the origin, which presents the regularization loss at some particular constant value. The diamond corresponds to the equation |w₁| + |w₂| = C, while the circle corresponds to w₁² + w₂² = C².

Do not be confused thinking that the point of intersection happens at the outmost boundary of the concentric circles. That is not true. The loss J(w) could still go larger, just that it is not shown in the diagram for cleanliness.

The point of interesction would lead to ideal w* to choose for the regression model, such that the overall loss (which includes the ||w||₁ or ||w||₂ component), is at a minimum given our training set.

Notice that for Lasso, the intersection happens along an axis, which means that the resulting model would only be dependent on one of the original two features. Hence, we have actually performed feature selection in the process.

Viewpoint #2

Think of a rubber band around the origin. When we use L1 regularization, we have a rhombus/diamond shape rubber band. When we use L2 regularization, we have a circular rubber band. The raw loss J has a pulling-out effect (since the solution space is everywhere), and this is constrained by this rubber band (ie. the solution must be somewhere along the rubber band). For L1, the resulting solution would be at one corner of the diamond. On the other hand, the solution of the round rubber band has no tendency to end up along along axis.

Which to choose?

There is no single best one-size-fits-all approach (not just here, but in general for data science).

Sometimes, we want to explain the importance of each and every feature. In this case we may use ridge (L2), because it does parameter shrinkage instead of totally removing some features, hence you still have a sense of how each feature affects the target variable. Other times, we may want to drop dependent variables as much as possible to simplify the visualizations process.

Note that things may not be as simple as just dropping anything. Suppose that feature X1 determines feature X2, and they are also correlated. If we run Lasso, it is possible to drop X1 and keep X2 instead. However, this would not be ideal because what really causes the change is actually X1. However, if we are simply concerned with having the prediction outputs as accurate as possible, this would not be an issue.

Disclaimer: All opinions and interpretations are that of the writer, and not of MITB. I declare that I have full rights to use the contents published here, and nothing is plagiarized. I declare that this article is written by me and not with any generative AI tool such as ChatGPT. I declare that no data privacy policy is breached, and that any data associated with the contents here are obtained legitimately to the best of my knowledge. I agree not to make any changes without first seeking the editors’ approval. Any violations may lead to this article being retracted from the publication.

--

--

James Koh, PhD
MITB For All

Data Science Instructor - teaching Masters students