How to measure feature importance in a binary classification model

An example in R language of how to check feature relevance in a binary classification problem

Gianluca Malato
Data Science Reporter
6 min readMar 7, 2019

--

One of the main tasks that a data scientist must face when he builds a machine learning model is the selection of the most predictive variables. Selecting predictors with low predictive power can lead, in fact, to overfitting or low model performance. In this article, I’ll show you some techniques to better select the predictors of a dataset in a binary classification model.

When a data scientist starts working at some model, he often doesn’t have a real idea of which the predictors should be. Maybe the previous phase of business understanding discarded some useless variables but we often have to face a giant table of hundreds of variables.

Training a model on such a huge table is not a good idea. You really run the risk of collinearity (i.e. correlations between variables). So we have to choose the best set of variables to use in order to make our model learn properly from the business information we are giving it.

Our goal is to increase the predictive power of our model against our binary target, so we must find those variables that are strongly correlated with it. Remember: information is hidden

--

--

Gianluca Malato
Data Science Reporter

Theoretical Physicists, Data Scientist and fiction author. I teach Data Science, statistics and SQL on YourDataTeacher.com. E-mail: gianluca@gianlucamalato.it