What are three approaches for variable selection and when to use which
When you work with real-world data, you will inevitably encounter a data set with multiple variables. Among those, how will you choose which variables to include in your model? Luckily, you have several options to consider.
Subset selection
The first option is subset selection, which uses a subset of predictors to make a prediction. There are three types of subset selections that we will look at: best subset selection, forward stepwise selection, and backward stepwise selection.
Best subset selection
As its name suggests, best subset selection finds the best model for each subset size. In other words, it produces the best models for 1 variable model, 2 variables model, and up to p variables model when there are p predictors. For example, when there are a, b, and c variables, best subset selection will evaluate a, b, c for a 1-variable model, ab, ac, bc for 2-variables model, and abc for 3-variables model to identify the best models. Other approaches are forward and backward selection.
Forward & Backward selection
Forward stepwise selection starts with a null model and adds a variable that improves the model the most. So for a 1-variable model, it tries adding a, b, or c to a null model and adds the one that gives the best result. For instance, let’s say a is chosen. Then for a 2-variables model, it will try adding b or c and choose the one that improves the most.
Unlike forward stepwise selection, backward stepwise selection starts with all variables and removes variables that provide the least improvement.
Choosing the best model
So these methods give us models for each subset size. But among those, how do we choose one? Here, you can use Mallows’ Cp, adjusted R², and Bayesian information criterion (BIC).
Cp compares the precision and bias among different models. Therefore, small Cp means that the model is relatively precise in estimating coefficients and predicting a response, which corresponds to a low test error. BIC tells us how well the model predicts on new data relative to other models by giving low BIC values to models with low test error. So for both metrics, we would want to get relatively low values.
Adjusted R², on the other hand, tells us how well a model explains the response variable. And unlike Cp and BIC, high adjusted R² corresponds to a low test error.
However, these are only metrics. The most important point in selecting variables is looking into actual models with domain knowledge to evaluate whether models seem reasonable or not.
Shrinkage
The second option you can consider is shrinkage. Unlike the previous approaches, shrinkage methods contain all predictors and shrink the coefficients estimates toward zero so that they have less impact on a response. Methods that use shrinkage are ridge and lasso regression.
Ridge regression
Similar to how least squares regression estimates coefficient by minimizing RSS, ridge regression estimates coefficient by minimizing
From the equation, the λ is called a tuning parameter and λ∑βⱼ² is called a penalty term. When λ is equal to zero, the penalty term will have no effect. Thus, the equation would become a least squares estimate. When λ approaches to infinite, ridge regression coefficient estimate approaches zero and the model becomes a null model. As you can see, coefficient estimate differs greatly depending on the λ value. Therefore, it’s critical to choose a good one.
One thing to note is that ridge regression is not scale equivariant. So for instance, 1000cm will have a much greater impact on a model than $1 since 1000 is a much larger number than 1. Therefore, it’s crucial to standardize predictors before performing a regression.
Lasso regression
Compared to ridge regression, it minimizes
for coefficient estimates. Except that lasso has a different penalty term from ridge regression, lasso regression is pretty much similar to ridge regression. As in ridge regression, standardizing predictors and selecting a good λ is critical. It also gives a least squares fit when λ = 0 but a null model when λ approaches infinity. So what does the difference in the penalty terms differentiate lasso and ridge?
Because the penalty term is squared for ridge regression but an absolute value for lasso regression, edges of constraint regions defined by penalty terms are round for ridge but straight for lasso. To grasp the idea easily, see the plots below.
From the plots above, red ellipses are the contours of RSS values and the blue regions are the constraint regions of lasso and ridge. As I mentioned above, the edges for ridge regression are round because the penalty term is like a circular function: x² + y² = r². Therefore, the constraint region cannot meet RSS on the axes where the coefficient estimates equal to zero.
In contrast, because the absolute values in the penalty terms form straight lines — imagine |x|+|y|=1, lasso’s constraint region has straight edges. Therefore, the constraint region can meet RSS on the axes and turn coefficient estimates into zeros. The plots above are for p=2 for simplicity, but the same concept applies when p>2.
Choosing the best model
As I noted above, selecting a good λ value is directly related to choosing the best model. However, we do not know what could be a good value for λ. So we compute a cross-validation error for a grid of λ values and choose the lowest one.
Dimension Reduction methods
The last approach involves transforming independent variables and then fitting a model using these transformed variables. Principal component and partial least squares fall into this category.
Principal component analysis
The fundamental idea of this method is to use a low-dimensional set of features from a large set of variables to explain data. To achieve this, it uses principal component direction that has a function of
Inside the function, ϕ is called a principal component loading which defines the direction of the function. And to capture the most direction, its value is chosen to yield the highest variance. In this case, variance means the average of squared distances from projected points (red points from the plot below) to the origin (0,0). So to increase variance, projected points should be far from the origin, and this can be achieved by setting the principal component direction along the direction of data points. To visually see this idea, check the graph below.
One restriction with principal component loadings is that their sum of squared should be equal to 1 since their values will get infinitely large without the restriction.
Another important thing to take note of is standardizing variables. Because principal component direction puts the highest weight on a predictor with the highest variance, a predictor with a wider range of values will have a higher chance of being a significant direction. For instance, regardless of the true relationship, the equation would put more weight on a predictor ranging from 0 to 1000 than another predictor ranging from 0 to 1 simply because the previous one is more spread out. And due to standardizing variables, standardized data points have a mean of 0 like the graph above.
Principal component regression — PCR
With p predictors in a data set, we can construct up to p principal component directions. Among those, the first principal direction is along the direction that observations vary the most. And the second principal component is uncorrelated with the first principal component — therefore, perpendicular to the first one — and is along the next most direction of data points.
The key idea of PCR with these directions is that a small number of M principal components is enough to explain most of the variability. Therefore, choosing an appropriate M is crucial .
Partial least squares — PLS
Another dimension reduction method we can use is PLS. This method is just like PCR, except that it is a supervised version. So the calculation for principal component loadings is a little bit different.
Instead of choosing principal component loadings that give maximum variance, PLS computes the directions by setting each loading equal to the coefficient from the simple linear regression of Y onto Xj [5]. Therefore, unlike PCR that puts the highest weight on the highest variance, PLS puts the highest weight on variables that are related the most to the response.
Choosing the best model
Just like how we did in shrinkage, cross-validation error is used to identify the number of M to use in our model.
Tl:dr
Subset selection: selects the best model for each subset size → use Cp, BIC, adjusted R², and domain knowledge to choose a model
Shrinkage: minimizes coefficient estimate functions (RSS + penalty term) → find appropriate λ value by evaluating cross-validation error
Dimension reduction: use M principal component directions to reduce the dimension of predictors → find appropriate M value by evaluating cross-validation error
When to use which
With all these methods, you might wonder which method is appropriate when. So here is a quick view of the pros and cons of each approach. First, let’s take a look at subset selection.
Subset selection
Compared to other subset selections, the main advantage of best subset selection is that it gives us the best model. However, because it explores 2ᵖ combinations of variables, it becomes computationally expensive quickly when p increases — when p=20, it becomes 1,048,576. Forward and backward selection improves this limitation.
Because they don’t explore every combination, they are computationally better than best subset selection. But as always, the advantage comes with a cost. They don’t produce the best models like best subset selection.
However, some people are not favorable to selection methods because they view this method as p-hacking. Meaning, it tries the same method multiple times, gets low p-values by chance, and calls that significant. It is like flipping 10 coins 1000 times to get 10 heads in a row. And then claiming the 50:50 ratio for a head and tail is wrong because I got 10 heads in a row when it happened merely by chance.
Therefore, rather than using selection methods for interpretative analysis, it is preferable to use in exploratory step to find few significant predictors. But even then, there are better approaches.
Shrinkage
Because of the difference in penalty terms between ridge and lasso, ridge tends to do better than lasso when there are many predictors associated with the response similarly. In contrast, lasso tends to do better when there are few strong predictors and others’ coefficients are close to zero. However, we do not know the relationships between predictors and response before performing any analysis. Therefore, it’s best to try both and choose a model with cross-validation.
Another difference is interpretability. Because ridge includes all variables in a model, it’s often hard to interpret the result. However, since lasso performs variable selections, it becomes clear what is related to the response and what is not. Therefore, when you want to understand relationships between predictors and response, lasso could be a better choice.
Dimension reduction
One big difference between PCR and PLS is that PCR is an unsupervised approach whereas PLS is a supervised one. Therefore, PCR does not include the response in its calculation when PLS does. So when the response is related to directions that have low variance, PCR will fail to predict accurately since it determines principal directions only based on yielding the highest variance. However, since PLS takes the response into account, it will perform much better.
Regardless of the pros and cons of each method, both can be used with highly correlated variables since they redefine variables with linearly independent directions. Also, they tend to do well when the first few directions are sufficient to reduce bias rapidly and don’t work well when it requires lots of principal components to predict.
Last note
It’s hard to know the true relationship between predictors and response before performing any kind of analysis. Thus, one method does not dominate the others universally. So I think trying different methods and analyzing different models with domain knowledge to understand the true relationship might be better than blindly applying methods and moving forward with just numerical values.
Reference
[1] AB, Cliff. “Are There Any Circumstances Where Stepwise Regression Should Be Used?” Cross Validated, 25 Jan. 2017, stats.stackexchange.com/questions/258026/are-there-any-circumstances-where-stepwise-regression-should-be-used.
[2] Chen, Haowen. Model Selection for Linear Regression Model, jbhender.github.io/Stats506/F17/Projects/Group21_Model_Selection.html.
[3] Flom, Peter. “Stopping Stepwise: Why Stepwise Selection Is Bad and What You Should Use Instead.” Medium, Towards Data Science, 11 Dec. 2018, towardsdatascience.com/stopping-stepwise-why-stepwise-selection-is-bad-and-what-you-should-use-instead-90818b3f52df.
[4] Jaadi, Zakaria. “A Step-by-Step Explanation of Principal Component Analysis (PCA).” Built In, 1 Apr. 2021, builtin.com/data-science/step-step-explanation-principal-component-analysis.
[5] James, Gareth, et al. An Introduction to Statistical Learning: with Applications in R. Springer, 2021.
[6] meh. “When AIC and Adjusted $R²$ Lead to Different Conclusions.” Cross Validated, 25 Apr. 2018, stats.stackexchange.com/questions/140965/when-aic-and-adjusted-r2-lead-to-different-conclusions.
[7] Munier, Robin. “PCA vs Lasso Regression: Data Science and Machine Learning.” Kaggle, www.kaggle.com/questions-and-answers/180977.
[8] Oleszak, Michał. “Regularization: Ridge, Lasso and Elastic Net.” DataCamp Community, www.datacamp.com/community/tutorials/tutorial-ridge-lasso-elastic-net.
[9] “Principal Component Regression vs Partial Least Squares Regression¶.” Scikit, scikit-learn.org/stable/auto_examples/cross_decomposition/plot_pcr_vs_pls.html.
[10] Wicklin, Rick. “Should You Use Principal Component Regression?” The DO Loop, 25 Oct. 2017, blogs.sas.com/content/iml/2017/10/25/principal-component-regression-drawbacks.html.