Member-only story
Feature Selection and EDA in Machine Learning
How to Use Data Visualization to Guide Feature Selection
In Machine Learning Lifecycle, feature selection is a critical process that selects a subset of input features that would be relevant to the prediction. Including irrelevant variables, especially those with bad data quality, can often contaminate the model output.
Additionally, feature selection has following advantages:
1) avoid the curse of dimensionality, as some algorithms perform badly when high in dimensionality, e.g. general linear models, decision tree
2) reduce computational cost and the complexity that comes along with a large amount of data
3) reduce overfitting and the model is more likely to be generalized to new data
4) increase the explainability of models
In this article, we will discuss two main feature selection techniques: Filter Methods and Wrapper Methods, as well as how to take advantage of data visualization to guide decision making.
Data Preprocessing
Before jumping into the feature selection, we should always load the dataset, perform data preprocessing and data transformation: