Features Selection in Machine Learning model Building.
Using chi-square test with python.
Feature selection is one of the important concept in machine learning topics because using it we can improve the performance of model and also it reduces the number of input variables. Irrelevant features will negatively impact the performance of machine learning model. Feature selection is performed before training the model.
By performing feature selection, it will find out the features which are relevant to model training, which are best fit in our model can perform better and reduces the training timing also improve the performance.
From the previous article part 2 we use the student datasets to find the feature which can use to train our model.
What is chi-square test
Before understanding what is chi-square test, terminology you should remember.
- State the Hypothesis ( Null Hypothesis and Alternative Hypothesis).
- Statistical significance.
- Contingency Table.
- level of significant.
- level of confident.
- Degree of Freedom.
- P-value.
- Critical value.
- Alpha value.
Analytics deals with, what you know. Statistics deals with what you don’t.
A chi-square statistic is a test that is used to measure how expectations compare to actual observed data. And it is used to determine whether there is a significant difference between the expected frequencies and the observed frequencies in one or more categories.
To perform hypothesis testing chi-square is often used. Hypothesis is a premise or claim that we want to test OR a fact and evidence explanation made to support the evidence to start(to draw) further analysis. Using the hypothesis test we examine Hull Hypothesis H0 and Alternative Hypothesis H1.
Now, we using student dataset to find features(also called variables or attributes selections) which are use to train the model, here we use this features selection variables to predict the student grade for accurate results.
Importing Package and Loading the dataset. And preparing data for chi-square analysis.
To perform chi square test analysis all features or variable should be organized into contingency table. From above image we need to convert Consultations to categorical values.
For any statistical analysis its important to use numeric values,Using Label Encoder converting categorical values in to Numerical Values.
Chi-Square Test Statistics
from sklearn.feature_selection import chi2
P-Value ( Probability value)
Visualization Feature selection variable which are use for model training.
From the above image note that, No of visits has the highest P-value, plot explains No of visits is independent of attendance and it cannot be consider for model training.
Source :- For more on GitHub
About Author : Raghu Bayya, Data Scientist ML/Deep Learning.
Expert in Big Data