How to measure feature importance in a binary classification model

An example in R language of how to check feature relevance in a binary classification problem

Gianluca Malato
Mar 7, 2019 · 6 min read

One of the main tasks that a data scientist must face when he builds a machine learning model is the selection of the most predictive variables. Selecting predictors with low predictive power can lead, in fact, to overfitting or low model performance. In this article, I’ll show you some techniques to better select the predictors of a dataset in a binary classification model.

When a data scientist starts working at some model, he often doesn’t have a real idea of which the predictors should be. Maybe the previous phase of business understanding discarded some useless variables but we often have to face a giant table of hundreds of variables.

Training a model on such a huge table is not a good idea. You really run the risk of collinearity (i.e. correlations between variables). So we have to choose the best set of variables to use in order to make our model learn properly from the business information we are giving it.

Our goal is to increase the predictive power of our model against our binary target, so we must find those variables that are strongly correlated with it. Remember: information is hidden inside the dataset and we must provide all the necessary conditions to make our model extract it. So we have to prepare data before the training phase in order to make the model work properly.

Numerical and categorical predictors have a different kind of approach and I’ll show you both.

Numerical variables

If we have 2 variables, say x and y, their linear correlation coefficient is given by the formula:

That is the covariance divided by the product of the standard deviations.

We are not interested in the sign of correlation. We just need to know its intensity. That’s why we use the absolute value.

I have often seen this kind of approach in many AI projects and tools. Honestly, I have to say that it’s not completely correct to calculate the correlation coefficient in this way. For a perfect predictor, we expect a Pearson coefficient absolute value equal to 1, but we could not achieve this value if we treat binary outcome as a binary number. It’s not important, however. We are using Pearson correlation coefficient to sort our features from the most relevant to the least one, so as long as the coefficient calculation is the same, we can compare the features between them.

Pearson correlation coefficient is not flawless, however. It only measures linear correlation and our variables couldn’t be linearly correlated. But in first approximation, we can easily calculate and use it for our purpose.

Categorical variables

Let’s say we have a histogram of N different categories with O observation that sum up to n and let’s say want to compare it with a theoretical histogram made by probabilities p. We can build a chi-square variable in this way:

This variable is asymptotically distributed as a chi-square distribution with N-1 degrees of freedom.

If our variable is not correlated to the target, we expect that, for each one of its values, we get 50% zeroes and 50% ones on our dataset. This is a theoretical histogram we could expect to have if there’s no correlation, so a one-tailed chi-square test performed to check whether the real histogram is similar to this one, should give us a p-value equal to 1 (i.e. a low chi-square value) if our variable is not correlated to the target. On the contrary, a perfect predictor will push p-value towards lower values (i.e. higher chi-square values).

Example in R

Remember that R has a powerful function cor that calculates the correlation matrix and the function chisq.test that performs the chi-square test.

First, we create a column named target that is equal to 1 when the species is virginica and 0 otherwise. Then we’ll check the correlations with the other variables.

Let’s start with the numerical features. With this simple code, it’s very easy to find the most correlated ones.

# Load iris dataset
# Generate a binary target column
iris$target = ifelse(iris$Species == “virginica”,1,0)
numeric_columns = setdiff(names(iris),”Species”)
target_corr = abs(cor(iris[,numeric_columns])[“target”,])

As you can see, the most correlated one is the petal width, then comes the petal length and so on. The correlation of the target with itself is obviously 1.

Let’s take a look at the plot of the target variable against the petal width:

As you can see, higher values of petal width lead to 1 and lower values lead to 0. That’s a clear correlation.

Now, let’s take a look at the plot of the target against the sepal length, which has been classified as the least representative variable:

It’s clear that there is a wide region approximately between 5.5 and 7 inside which we get 0 and 1 almost alternatively. The lack of a graphical pattern is always a good reason to suspect the lack of correlation.

For the categorical case, we’ll calculate the correlation between the target and the species variables. Of course, we expect a strong correlation, because we have built the target as a direct function of the species.

I’ll show you the single-line code and the results:

The table function generates the contingency table and the chisq.test function has been built in order to perform the chi-square test as we want for our case.

A very low p-value means a very strong difference from the uncorrelated case. As usual in the hypothesis tests, you don’t actually accept the null hypothesis, but refuse to neglect it.

We can get further confirmation by taking a look at the contingency table:

As you can see, the column values are very unbalanced, which is exactly what we are looking for.


However, this simple procedure can be used to check at first the most important variables and start a deeper analysis to find the best set of predictors for our model.

Data Science Reporter

Articles about data science, mathine learning, artificial…

Gianluca Malato

Written by

Theoretical Physicists, Data Scientist and fiction author. I teach Data Science, statistics and SQL on

Data Science Reporter

Articles about data science, mathine learning, artificial intelligence and their impact in our life.

Gianluca Malato

Written by

Theoretical Physicists, Data Scientist and fiction author. I teach Data Science, statistics and SQL on

Data Science Reporter

Articles about data science, mathine learning, artificial intelligence and their impact in our life.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store