Feature Selection Strategies For Regression Models

Ro Data Sip-and-Share Q1 2019

Ying Ma
Ying Ma
Apr 9 · 5 min read
  1. Greedy Search
  2. Regularization

Statistical Inference

The statistical inference approach estimates the standard error the coefficients of regression model, and then constructs a confidence interval and p-value to test whether the coefficients are significantly different than 0. If the null hypothesis of the coefficient being zero is rejected with a small p-value, it means this feature has some genuine effects on the target.

Example of statsmodels’ logistic regression summary.

Greedy Search

Comparing to the statistical approach, the greedy search method is more practical and leaning towards the realm of machine learning engineering. The general idea of greedy search is to generate models with various combination of features and narrow down features subsets with the optimal model performance. There’re several alterations of greedy search strategy, and in here I will discuss two of them.


Regularization

Regularization is another way of identifying and modifying important features to prevent overfitting, but without actively removing any features from the original dataset. To minimize the impact of meaningless and correlated features to the model, regularization curtails the coefficients of these features so that they do not contribute to the prediction results. This goal is achieved by adding penalization for the coefficients to the lost function.

Demonstration of L1 (left) and L2 (right) regularization. (An Introduction to Statistical Learning by Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani)

Ro Data Team Blog

Ro Data Team Blog: data analytics, data engineering, data science

Ying Ma

Written by

Ying Ma

Ro Data Team Blog

Ro Data Team Blog: data analytics, data engineering, data science