Feature Elimination Using p-values

FAHAD ANWAR
Analytics Vidhya
Published in
8 min readOct 13, 2019
Photo by Louis Reed on Unsplash

Introduction

In this post we will be talking about Hypothesis Testing, How p-value is calculated and how it helps us in Hypothesis Testing with a simple example of Students’ scores . Then we will move on to how it helps us in eliminating Features in a Medical Expenses Dataset, and then fit a model to it.

P-values

In statistical hypothesis testing, the p- value or probability value is the probability of obtaining test results at least as extreme as the results actually observed during the test, assuming that the null hypothesis is correct.

Wikipedia

Ok! let’s break that down. Now some of you who are wondering what is Hypothesis Testing, p-value and null hypothesis in the above definition of p-value by Wikipedia.

Hypothesis Testing

Hypothesis is nothing but an assumption which has not been tested yet and Hypothesis Testing is simply checking if that assumption is correct or not.

Take an example of School scores,
Your teacher says that students score an average of 70% or more and you want to prove that it is way lesser than that.

As a general rule, we set null hypothesis (H₀) to be the opposite of what we want to test and alternate hypothesis(Hₐ) to be what we want to test.
In our case,
H₀: Students score an average of 70% or more
Hₐ: Students score an average lesser than 70%.

At first, H₀ is assumed to be true just like how an accused in a court trial is innocent until proven guilty. It is the prosecutors job to prove that accused is guilty. Right now H₀ is on trial and we have to provide evidence to reject the null hypothesis H₀. But what if we don’t find any evidence to support our claim? In that case , we say that we have ‘Failed to Reject the null hypothesis’. We don’t say that H₀ is True just because we have not found suitable evidence. It could be because we haven’t looked at the right place!

So, how do we find that evidence?
Let’s say we have data of the past couple of years of School scores (we call this population data or simply — population). From the population, we take samples of data and try to find out evidence that the students score an average lesser than 70%.

Let’s go deep down into how exactly we do it.
Assume that we take 1000 samples at random (so that, nobody can accuse us of foul play!). Then we take averages of those 1000 samples and plot them.
Now, the interesting part is that even if the population is not normally distributed, The means of the samples will always be a normal distribution curve with mean very close to the population mean (For more info , Read Central Limit Theorem)

Normally Distributed Sample Means Graph

Here μ refers to the Mean and σ refers to the Standard Deviation. μ-σ to μ+σ covers 95% of the curve. This is true in all cases (check out 68–95–99.7 Rule)
Now that we have plotted all the sample means, what next?

Now, We need evidence to Reject our null hypothesis. Enter P-Value.

P-value is simply the Random Chance Probability value. Assuming null hypothesis is true (remember, innocent until proven guilty), It tells us what is the probability that observed value comes out to be lesser than 70 just by random chance.

So, if this value is higher, then we say that it is just random chance that x<70, and we ‘fail to reject null hypothesis’. But if this value is low, then we say that it is highly unlikely that observed value came out to be lesser than 70 just by random chance and we reject the null hypothesis.

But this p-value is quite elusive. To find p-value we must first find Z-value.

Z-Value basically tells us how many standard deviation away from mean is the observed value.

Z-value Formula

Where x= observed value, μ refers to the Mean and σ refers to the Standard Deviation.

After calculating the z-value, we get the p-values associated with each z-value in Z-table.

Let’s say the p-value came up to be 0.15, or even 0.05. What do we consider as a threshold? Even before the experiment starts we need to set a significance level α (0.05) Generally, this is the significance level that is used in most business scenarios. Use this if its not specified in the problem statement.

If the p-value came up to be lets say 0.20 (> α), It means that probability of students scoring an average lesser than 70% just by random chance is 20% and therefore insignificant.
And if the p-value came up to be 0.03 (< α), then it means that probability of students scoring an average lesser than 70% just by random chance is only 3% and therefore has some truth to it.

In this case, we can confidently say that we have Rejected the null hypothesis (and thereby proved the alternate hypothesis that Students score an average lesser than 70%.)

Now that you have got a fair idea of what is p-value , what it signifies and how to use it in hypothesis testing, we can go ahead and see how it helps us in feature elimination during fitting a linear regression model.

Feature Elimination using p-value

Let’s take a Medical Insurance dataset and try to predict the Medical expenses from an individual, based on factors like age, sex, bmi etc. so that the Insurance company can set the premium accordingly.

How does Hypothesis testing and p-value fit into this?

We want to find out if the columns/features do indeed affect the medical expenses.

H₀: Column/Feature does not affect medical expenses.
Hₐ: Column/Feature affects medical expenses.

So, if a column shows p-value <=0.05 then we reject the null hypothesis and say that ‘ Column/Feature affects medical expenses.

We don’t have to actually calculate p-values for each and every column. We can simply use OLS from statsmodels.api which basically helps to fit linear regression model and also lets us know what the p-values are.

Let’s jump right into the code.

Importing Libraries.

import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from sklearn import metrics
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
# Let's load our csv data into DataFrame
df = pd.read_csv("insurance.csv") df.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
age 1338 non-null int64
sex 1338 non-null object
bmi 1338 non-null float64
children 1338 non-null int64
smoker 1338 non-null object
region 1338 non-null object
expenses 1338 non-null float64
dtypes: float64(2), int64(2), object(3) memory usage: 73.3+ KB

Let’s take a peek into the data

# Take a peek into data 
df.head()

Output:

  age sex bmi children smoker region expenses 
0 19 female 27.9 0 yes southwest 16884.92
1 18 male 33.8 1 no southeast 1725.55
2 28 male 33.0 3 no southeast 4449.46
3 33 male 22.7 0 no northwest 21984.47
4 32 male 28.9 0 no northwest 3866.86

Since Cleaning the data and data preparation is not in the scope of this article, I have skipped them . If you want to know how to use one-hot-encoding technique and removing outliers on dependent column ( expenses),Check out the full notebook in my Github link posted at the bottom.

After Data Cleaning and pre-processing,

age     sex     bmi     children smoker expenses region_northwest region_southeast region_southwest19  0   27.9    0   1   16884.92    0   0   118  1   33.8    1   0   1725.55     0   1   028  1   33.0    3   0   4449.46     0   1   033  1   22.7    0   0   21984.47    1   0   0

Now we will try to fit a model to this data and try to predict the expenses (dependent variable).

x = df[df.columns[df.columns != 'expenses']]y = df.expenses# Statsmodels.OLS requires us to add a constant.x = sm.add_constant(x)model = sm.OLS(y,x)results = model.fit()print(results.summary()

Output:

As we can see ,
R-squared: 0.753
Adj. R-squared: 0.752

p-values can be found under P>|t|

We also have p-values >0.05 for columns sex, region_northwest. We will remove these columns one by one and check the difference in the metrics of the model.

x.drop('sex',axis=1, inplace=True) 
model = sm.OLS(y,x)
results = model.fit()
print(results.summary())

Output:

R-squared: 0.753
Adj. R-squared: 0.752

R-squared remains the same but Adj. R-squared increased. That is because, Adj.R-squared takes the number of columns into consideration, whereas R-squared does not. So it’s always good to look at Adj. R-squared while removing/adding columns. In this case, removal of region_northwest has improved the model since Adj. R-squared increased and moved closer towards R-squared.

x.drop('region_northwest',axis=1, inplace=True) 
model = sm.OLS(y,x)
results = model.fit()
print(results.summary())

Output:

R-squared: 0.753
Adj. R-squared: 0.752

We can see that region_southwest and region_southeast have p-values 0.056 and 0.053. We can choose to ignore this since it is very close to α (0.05).

So finally,
predicted_expense = ( age x 255.3) + ( bmi x 318.62) + ( children x 509.21) + ( smoker x 23240) — ( region_southeast x 777.08) — ( region_southwest x 765.40)
So, as we can see the highest factor that affects is if the person is a smoker or not! A smoker tends to pay 23,240 more medical expense than a non-smoker.

If you’ve learnt nothing from this post, at least you would have learnt that smoking not only burns your lungs, but your wallet too!

--

--