An Overview of the Statistical Approach: Feature Selection in Machine Learning

Elias Hossain
Analytics Vidhya
Published in
5 min readJun 20, 2020

When working with Machine Learning (ML), feature selection is one of the most essential tasks. If you do not have proper conception regarding the feature selection of ML, you could have various difficulties. Even though several kinds of feature selection technique can be found, however, the Pearson correlation coefficient is reliable, which I am going to explain in this article. This article will review the feature selection method of ML and hands-on practical implementation.

A few words before the beginning

As this article is precisely advanced, it would be great if you learn basic statistics and ML as well. Statistics are remarkably crucial for ML and Data Science, whereas the contribution of statistics in each case in scientific research is essential.

The Pearson correlation coefficient

In a word, Correlation is a relationship between two variable. In other words, Correlation indicates the extent to which two or more variables fluctuate together. The value of correlation is scaled, within -1 to 1. The closer to 1 the more powerful the connection between the two variables, the lower the relationship closer to 0. Fig.1. Shows the equation is expressed by the Pearson correlation equation, r. Fig.2. Shows the scale of Correlation.

Fig.1: Pearson correlation equation

Here:

  • rxy — the correlation coefficient of the linear relationship between the variables x and y
  • xi — the values of the x-variable in a sample
  • — the mean of the values of the x-variable
  • yi — the values of the y-variable in a sample
  • ȳ — the mean of the values of the y-variable
Fig.2: Pearson’s Correlation scale

To understand the correlation and coefficient, follow the correlation matrix below (Fig.3) where different plotting and strong co-relation under each are shown. Through this diagram, you can understand the scale measurement of the relationship between the variable. The details sequence and consequence are shown in Fig.3

Fig.3: Correlation Matrix

Let us find out the correlation through a mathematical equation. We will use the example of a dataset as a scenario and see how to do the feature selection.

Scenario: Suppose you are a Machine Learning (ML)expert and Mr. X is an Ice Cream seller. X is very disappointed to see the sale of ice cream in May 2020. Cellar’s opinion is that when it is hot weather, Ice cream sales increase which has been in the past months but did not sell well despite being hot in June. In that case, the ML expert should ask the Ice cream seller to get Ice cream sales data for the past month. Considered, X gives you now the data which you asked him and now you have to find out the correlation.

Step 01: Collect dataset.

Table.1: Dataset

Step 02: Calculate the average for each variable“Temperature” and “Sold” as we are going to find out the correlation between them. Have a look at the Table.2

Table.2: Calculate the average for each variable

Step 03: After the calculation of the average, we can find the other values. A summary of the calculations is given in the Table.3:

Table.3: Find the other values

r=4730 / √((540)(45891.667)) = 0.9502

So, the value of R is 0.9502.

According to the Pearson correlation coefficient scale, R has a value of 0.9502 which proves that there is a strong relationship between Temperature and Sold variables. If you have many variables just like this you can find out the relationship of two variables by finding out the correlation and by this way you will be able to find out the independent variable for your machine learning model. So, it was the theoretical concept about Correlation between two variables. Now we will build a machine learning model with this data set and look at it practically by python & Jupyter notebook.

Step 01: Import essential module and library & Load the Dataset

Fig.4: Including library and load the dataset

Step 02: Visualization data in graphical representation(Not necessary, but helps to understand the dataset)

Fig.5: Data visualization by scatter plot

When you plot the data then you will see the correlation like Fig.6.

Fig.6: Correlation between the variables (Scatter plot)

Step 03: Define Independent and Dependent variable & Split Dataset for Train and Test

Fig.7: Feature selection & perform train test split

Step 04: Fit data into the Machine Learning Algorithm

Fig.8: Fit data into the Machine Learning Algorithm

Step 05: Check the accuracy of the model & find out the correlation

Fig.9: Accuracy checking & find out the correlation coefficient

When you will check the correlation then you will see like Fig.10.

Fig.10: Correlation coefficient

To conclude, working in machine learning is basically done in a few steps. Determining variables I mean to say, selecting independent and dependent variables is extremely essential. Statistics is inextricably linked with data science and machine learning. This article reviews the statistical approach “The Pearson Correlation Coefficient” technique and provides an overview of theoretical explanations as well as practically feature selection through machine learning.

Find the Jupyter Notebook: https://github.com/eliashossain001/Machine_Learning_Feature-selection

Also, find the various types of data visualization technique along with source code:

I have taken the help of some references while writing the article:

https://corporatefinanceinstitute.com/resources/knowledge/finance/correlation/

https://www.researchgate.net/publication/260989251_The_Influence_of_Transparency_on_the_Leaders'_Behaviors_A_Study_among_the_Leaders_of_the_Ministry_of_Finance_Yemen

https://www.tutorialspoint.com/machine_learning_with_python/machine_learning_with_python_correlation_matrix_plot.htm

If you would like to find my research papers and another update then follow this link:

Research gate: https://www.researchgate.net/profile/Elias_Hossain7

LinkedIn: https://www.linkedin.com/in/elias-hossain-b70678160/

Twitter: https://twitter.com/eliashossain_

--

--

Elias Hossain
Analytics Vidhya

I am a Software Engineer. My research interest is diverse, intelligent systems, and I am eager to learn more about them