Features Selection in Machine Learning model Building.

Using chi-square test with python.

Raghu Bayya
Analytics Vidhya
3 min readDec 28, 2019

--

Picture by Raghu Bayya, library

Feature selection is one of the important concept in machine learning topics because using it we can improve the performance of model and also it reduces the number of input variables. Irrelevant features will negatively impact the performance of machine learning model. Feature selection is performed before training the model.

By performing feature selection, it will find out the features which are relevant to model training, which are best fit in our model can perform better and reduces the training timing also improve the performance.

From the previous article part 2 we use the student datasets to find the feature which can use to train our model.

What is chi-square test

Before understanding what is chi-square test, terminology you should remember.

  1. State the Hypothesis ( Null Hypothesis and Alternative Hypothesis).
  2. Statistical significance.
  3. Contingency Table.
  4. level of significant.
  5. level of confident.
  6. Degree of Freedom.
  7. P-value.
  8. Critical value.
  9. Alpha value.

Analytics deals with, what you know. Statistics deals with what you don’t.

A chi-square statistic is a test that is used to measure how expectations compare to actual observed data. And it is used to determine whether there is a significant difference between the expected frequencies and the observed frequencies in one or more categories.

To perform hypothesis testing chi-square is often used. Hypothesis is a premise or claim that we want to test OR a fact and evidence explanation made to support the evidence to start(to draw) further analysis. Using the hypothesis test we examine Hull Hypothesis H0 and Alternative Hypothesis H1.

Now, we using student dataset to find features(also called variables or attributes selections) which are use to train the model, here we use this features selection variables to predict the student grade for accurate results.

Importing Package and Loading the dataset. And preparing data for chi-square analysis.

Source : Jupyter Notebook

To perform chi square test analysis all features or variable should be organized into contingency table. From above image we need to convert Consultations to categorical values.

Source : Jupyter Notebook
Source : Jupyter Notebook

For any statistical analysis its important to use numeric values,Using Label Encoder converting categorical values in to Numerical Values.

Source : Jupyter Notebook

Chi-Square Test Statistics

from sklearn.feature_selection import chi2

Source : Jupyter Notebook

P-Value ( Probability value)

Source : Jupyter Notebook

Visualization Feature selection variable which are use for model training.

Source : Jupyter Notebook

From the above image note that, No of visits has the highest P-value, plot explains No of visits is independent of attendance and it cannot be consider for model training.

Source :- For more on GitHub

About Author : Raghu Bayya, Data Scientist ML/Deep Learning.

Expert in Big Data

--

--