P-values in statistics simplified and the DataScience application.
Pre-Requisite — Hypothesis and Hypothesis Test
A hypothesis is an assumption about an expected association.
- Increasing apple fruit consumption will result in a decreased frequency of visits to a doctor.
- Businesses that give customers loyalty points have more customer loyalty than businesses that don’t.
In the research or statistics world, your objective is to determine whether there is enough statistical evidence in favor of a particular belief or hypothesis about a parameter as seen in our examples above. This is known as hypothesis testing.
Examples of Hypothesis Tests in the Real World
- Clinical trials to determine whether some new treatment or drug causes improved outcomes in patients.
- In planning market strategies to determine which advertisement medium leads to increased sales or customer acquisition.
- In farming to determine if a fertilizer results in increased growth in plants.
Null and Alternate Hypothesis
In every experiment, there is an effect between groups that researchers are testing.
The reality however is that there is always a possibility that there isn’t any effect between the groups under observation. This lack of a difference is called the null hypothesis.
Null Hypothesis (Ho)— This is the assumption that there is no statistical relationship between two sets of observed data.
- The new fertilizer has no impact on the growth of plants.
- There is no relationship between loyalty points and customer loyalty in a business
Alternate Hypothesis(Ha) — This is the logical opposite of the Null Hypothesis. It is a claim about a population that contradicts the Null Hypothesis. i.e there is a statistical relationship between two sets of observed data.
- The new fertilizer results in faster-growing plants.
- Giving customer loyalty points results in an increased length of membership years.
When we say someone is playing the ‘devil’s advocate’ what do we mean? It means that they are ‘pretending’ in a discussion to be against the idea the majority of the people support in order to provoke debate or test the strength of the opposing arguments.
P-values in statistics support the devil’s advocate claims. i.e aiming to prove that the null hypothesis is true. P values in the research and science world often determine what studies get published and what projects get funding.
P-value is ‘the probability’ for the null hypothesis to be true.
Let’s say we want to test the effectiveness of fertilizer on the growth of plants. In this case, fertilizer A applied to the first group of crops and fertilizer B applied to the second group of crops.
Let’s begin by formulating our Null Hypothesis and Alternate Hypothesis
Null Hypothesis: There is no observed difference in the effect of fertilizers A and B on our crops. i.e effect of fertilizers A and B will be the same.
Alternate Hypothesis: There is an observed difference in the effect of fertilizers A and B on the growth of crops. i.e effect of fertilizers A and B will different.
The next step will be to conduct a Hypothesis Test.
For instance, let’s assume a T-test(Type of hypothesis test)has been conducted to get the P-value.
The p-value can take any value between 0 and 1. P-values are expressed as decimals although it’s easier to understand them as percentages. Example p-value of 0.03 is 3%.
Let’s assume we get the p-value =0.2.
- This means that if we conduct the experiment 100 times, 20 out of 100 null Hypothesis will be true.
If we get the p-value=0.1
- This means that if we conduct the experiment 100 times, 10 out of 100 null Hypothesis will be true.
If we get the p-value=0.05
- This means that if we conduct the experiment 100 times, 5 out of 100 null Hypothesis will be true.
The above were examples to show us how to interpret the p-value.
Let’s say we conduct the experiment on the fertilizer effect and find that the p-value is 0.3. This means that if we conduct the experiment 100 times, 3 out of 100 null hypothesis will be true. Remember that our null hypothesis states that there is no observed difference in the effect of fertilizers A and B on our crops.
Is our p-value significant?
P-value does not give us enough information by itself. Significance level, also known as alpha or α, is a measure of the strength of the evidence that must be present in our sample before rejecting the null hypothesis and concluding that the effect is statistically significant. This significance level is entirely on you as the researcher and is usually decided before anexperiment. The alpha value is also expressed as a decimal and it’s also easier to understand them as percentages.
Let’s assume before the experiment we decide our level of significance is 0.05. In this case the p-value <alpha. i.e 0.03<0.05.
In this case, we would reject the null hypothesis and conclude that fertilizers A and B are significantly different. In other words, the evidence in our sample is strong enough to be able to reject the null hypothesis at the population level.
p-value > alpha
Let’s assume before the experiment we decide our level of significance is 0.01. In this case the p-value >alpha. i.e 0.03>0.01. In this case, we will fail to reject the null hypothesis and conclude that there is no significant difference between fertilizer A and B. In other words, the evidence in our sample is not strong enough to reject the null hypothesis at the population level.
In Summary, the following steps are done in Hypothesis Testing:
- State the Hypothesis.
- Formulate the Null and Alternate Hypothesis.
- Determine the Level of Significance(alpha).
- Determine the Test Statistic to use.
- Compute the Test Statistic and get the p-value.
- Compare the p-value with the Level of Significance.
- Reject or Fail to Reject the Null Hypothesis.
P — Value Interpretation in DataScience
Let’s now see how to use this p-value knowledge in a data science context during model development and evaluation.
Let’s begin by rephrasing the Null and Alternate Hypothesis in DataScience
Null Hypothesis: The independent variable has no significant effect on the target variable
Alternate Hypothesis: The independent variables have a significant effect on the target variable.
In data science, we perform feature selection based on the fact that not all the independent variables in our dataset have a significant impact on our dependent variable(target variable). Furthermore, the more independent variables we use in our models, the more complex the model becomes and the more the performance of our model reduces. We, therefore, want to only have significant independent variables.
We can therefore use the p-value of the independent variables in a dataset to determine whether they have a significant impact on the dependent variable(target variable).
The image below is the model summary of a dataset that tries to determine which channel between the average customer session in a physical store(x1), the total time customer spends on an app(x2), the length of Membership of the customer(x3), the total time a customer spends shopping via a website, has the most impact on the yearly amount the customer spends in the store.
Let’s choose our alpha value to be 0.05.
Let’s analyze variables x1,x2,x3, and x4 in the statistics summary image above. When we check the p>t column highlighted in yellow, we get the p-values for variables x1,x2,x3, and x4.
We would retain variables x1,x2, and x3 because the p-values(0.000) for these variables are less than the alpha value. When the p-value <alpha we reject the null hypothesis which states that the independent variable has no significant effect on the target variable.
The variable x4 p-value = 0.721. This means that p-value>alpha-value(0.721>0.05).We will fail to reject the null hypothesis and conclude that the independent variable has no significant effect on the target variable and therefore remove that variable.
In later articles, I will expound on the following concepts.
- One-tailed Hypothesis test.
- Two-tailed Hypothesis test
- Types of Hypothesis tests eg Z-test, T-test, Chi-Square.
- Type 1 and Type 2 Errors in hypothesis tests
- Demystifying the misconceptions about the p-value.