Data Analytics Using Python (Part_4)

Published in

Budding Data Scientist

10 min readApr 20, 2020

This is the fourth post among the 12 series of posts in which we will learn about Data Analytics using Python. In this post, we will look into what hypothesis testing is and how it can be performed. We will also see the errors related to hypothesis testing.

Index

Hypothesis Testing
Errors in Hypothesis Testing
Three Approaches for Hypothesis Testing
Tests About a Population Mean: When σ is Unknown
Hypothesis Testing — Proportion

Hypothesis Testing

Hypothesis testing can be used to determine whether a statement about the value of a population parameter should or should not be rejected. The null hypothesis, denoted by Ho, is a tentative assumption about a population parameter. The alternative hypothesis, denoted by Ha, is the opposite of what is stated in the null hypothesis. The hypothesis testing procedure uses data from a sample to test the two competing statements indicated by Ho and Ha.

Null and Alternative Hypotheses about a Population Mean μ:

The equality part of the hypotheses always appears in the null hypothesis. In general, a hypothesis test about the value of a population mean μ must take one of the following three forms (where μo is the hypothesized value of the population mean):

Let us look at an example. A major hospital in Chennai provides one of the most comprehensive emergency medical services in the world. Operating in a multiple hospital system with approximately 10 mobile medical units, the service goal is to respond to medical emergencies with a mean time of 8 minutes or less. The director of medical services wants to formulate a hypothesis test that could use a sample of emergency response times to determine whether or not the service goal of 8 minutes or less is being achieved. For μ = mean response time for the population of medical emergency requests, the null and alternate hypothesis would be:

Errors in Hypothesis Testing

Type I Error (α)

Because hypothesis tests are based on sample data, we must allow for the possibility of errors. A Type I error is rejecting Ho when it is true. The probability of making a Type I error when the null hypothesis is called the level of significance. Applications of hypothesis testing that only control the Type I error are often called significance tests.

Type II Error (β)

A Type II error is accepting Ho when it is false. It is difficult to control for the probability of making a Type II error. Statisticians avoid the risk of making a Type II error by using “do not reject Ho” and not “accept Ho”.

For constant n, increasing the acceptance region (hence decreasing α) increases β. Increasing n, can decrease both types of errors. Both Type I & II Errors Have an Inverse Relationship. If you reduce the probability of one error, the other one increases so that everything else is unchanged.

Factors Affecting Type II Error

True value of population parameter: β-value increases when the difference between hypothesized parameter and its true value decrease.
Significance level α: It increases when β decreases.
Population standard deviation σ: It increases when β increases.
Sample size: β-value increases when n decreases.

Power of the Test

The probability of correctly rejecting Ho when it is false is called the power of the test. For any particular value of m, the power is 1 — β. We can show graphically the power associated with each value of μ. Such a graph is called a power curve.

Three Approaches for Hypothesis Testing

1. P — Value

The p-value is the probability, computed using the test statistic, that measures the support (or lack of support) provided by the sample for the null hypothesis. If the p-value is less than or equal to the level of significance α, the value of the test statistic is in the rejection region. Reject Ho if the p-value < α.

Steps of Hypothesis Testing — P value approach:

Step 1. Develop the null and alternative hypotheses.
Step 2. Specify the level of significance α.
Step 3. Collect the sample data and compute the test statistic.
Step 4. Use the value of the test statistic to compute the p-value.
Step 5. Reject Ho if p-value < α.

For a lower-tailed test, first we need to find the z-value using the test statistic and then we can use the python code stats.norm.cdf(z-value) to find the corresponding p-value. Then, if needed, we find the z-value of the significance level using the code stats.norm.ppf(significance_level). If the p-value is less than the α value, we reject the null hypothesis.

Lower-Tailed Test About a Population Mean: When σ is Known

For an upper-tailed test, first we need to find the z-value using the z statistic and then we can use the python code 1-stats.norm.cdf(z-value) to find the corresponding p-value. If the p-value is less than the α value, we reject the null hypothesis.

Upper-Tailed Test About a Population Mean: When σ is Known

p-Value Approach to Two-Tailed Hypothesis Testing:

Compute the p-value using the following three steps:

Compute the value of the test statistic z.
If z is in the upper tail (z > 0), find the area under the standard normal curve to the right of z.
If z is in the lower tail (z < 0), find the area under the standard normal curve to the left of z.
Double the tail area obtained in step 2 to obtain the p –value. We can use the python code (1-stats.norm.cdf(test_statistic_value))*2 to get this value.

The rejection rule:

Reject Ho if the p-value <α.

Two-Tailed Tests About a Population Mean: When σ is Known

2. Critical Value Approach to Hypothesis Testing

If the test statistic z has a standard normal probability distribution. We can use the standard normal probability distribution table to find the z-value with an area of α in the lower (or upper) tail of the distribution. The value of the test statistic that established the boundary of the rejection region is called the critical value for the test. The rejection rule is:

Lower tail: Reject Ho if z < -z(α).

Upper tail: Reject Ho if z > z(α).

Steps of Hypothesis Testing (One-Tailed) — Critical Value Approach:

Step 1. Develop the null and alternative hypotheses.
Step 2. Specify the level of significance α.
Step 3. Collect the sample data and compute the test statistic.
Step 4. Use the level of significance α to determine the critical value and the rejection rule.
Step 5. Use the value of the test statistic and the rejection rule to determine whether to reject Ho.

For lower tailed test, after finding the value of the test statistic, we can find the z(α) value corresponding to the level of significance α using the code stats.norm.ppf(significance_level) and compare both the values to accept or reject the null hypothesis according to the rejection criteria.

For upper tailed test, after finding the value of the test statistic, we can find the z(α) value corresponding to the level of significance α using the code stats.norm.ppf(1-significance_level) and compare both the values z and z(α)to accept or reject the null hypothesis according to the rejection criteria.

Critical Value Approach to Two-Tailed Hypothesis Testing:

The critical values will occur in both the lower and upper tails of the standard normal curve.
Use the standard normal probability distribution table to find z(α/2) (the z-value with an area of α/2 in the upper tail of the distribution) or we can use the python code stats.norm.ppf(z(α/2)_value).
The rejection rule is: Reject Ho if z < -z(α/2) or z > z(α/2).

3. Confidence Interval Approach

Confidence Interval Approach to Two-Tailed Tests About a Population Mean:

Select a simple random sample from the population and use the value of the sample mean to develop the confidence interval for the population mean μ. If the confidence interval contains the hypothesized value 500, do not reject Ho. Otherwise, reject Ho. Actually, Ho should be rejected if μo happens to be equal to one of the end points of the confidence interval.

Consider an example: Assume that a sample of 30 milk carton provides a sample mean of 505 ml. The population standard deviation is believed to be 10 ml. Perform a hypothesis test, at the 0.03 level of significance, population mean 500 ml and to help determine whether the filling process should continue operating or be stopped and corrected.

This is a two-tailed test and value of z(α/2) can be found using stats.norm.ppf(0.015)which gives a value of ±2.17. So, the 97% confidence interval for 500 is:

Because the hypothesized value for the population mean, μo= 500ml, is not in this interval, the hypothesis-testing conclusion is that the null hypothesis, Ho: μ= 500, is rejected.

Tests About a Population Mean: When σ is Unknown

When the standard deviation of the population is unknown, we use the following t test statistic:

This test statistic has a t distribution with n — 1 degrees of freedom.

Rejection Rule: p -Value Approach: Reject Ho if p –value ≤ α.

Rejection Rule: Critical Value Approach:

For a given sample, say, x=[10,12,20,21,22,24,18,15], we can find the p-value using the following code:

import numpy as np
from scipy import statsx=[10,12,20,21,22,24,18,15]
stats.ttest_1samp(x,15)
#Ttest_1sampResult(statistic=1.5623450931857947, pvalue=0.1621787560592894)

Here, x is the sample that we have and the stats.ttest_1samp() is the code that is used to do the 1 sample t test and the value 15 is the assumed mean. The output gives the test statistic value as 1.5623 and the p-value as 0.1621. The p-value is the two-sided p-value. If the test is one-tailed, the p-value has to be divided by 2. If the p-value is less than the α value, then, the null hypothesis is rejected. Another code that is used to find the area under the curve is stats.t.cdf(test_statistic_value, degrees_of_freedom).

Hypothesis Testing — Proportion

The equality part of the hypotheses always appears in the null hypothesis. In general, a hypothesis test about the value of a population proportion p must take one of the following three forms (where p0 is the hypothesized value of the population proportion).

The test statistic is given by:

Rejection Rule: p -Value Approach: Reject Ho if p –value ≤ α.

Rejection Rule: Critical Value Approach:

Let us understand this using an example. For a New Year’s week, the City Traffic Police claimed that 50% of the accidents would be caused by drunk driving. A sample of 120 accidents showed that 67 were caused by drunk driving. Use these data to test the Traffic Police’s claim with α= .05.

Determine the hypotheses: Ho: p=0.5 v/s Ha: p≠0.5.
Specify the level of significance: α = 0.05
Compute the value of the test statistic:

p-value approach:

4. Compute the p -value: For z = 1.28, cumulative probability =0 .8997. So, since this is a two-tailed test, the p–value = 2(1– 0.8997) = 0.2006

5. Determine whether to reject Ho: Since the p–value =0.2006 >α = 0.05, we cannot reject Ho. Hence, the claim is correct.

The python code to do the above calculation is:

from statsmodels.stats.proportion import proportions_ztest
count=67
P=0.5
samplesize=120
proportions_ztest(count,samplesize,P)#(1.286806739751111, 0.1981616572238455)

So, here, we get the z-value as 1.286 and the corresponding p-value as 0.198. And, we accept the null hypothesis as 0.198>0.05.

Critical Value Approach:

4. Determine the critical value and rejection rule: For α/2 = 0.05/2 =0 .025, z(0.025) = 1.96. So, we reject Ho if z ≤-1.96 or z ≥1.96.

5. Determine whether to reject Ho: Since 1.278 > -1.96 and < 1.96, we cannot reject Ho.

So, likewise, we will be able to analyse whether the hypothesis is to be accepted or rejected for a single sample test.