Data Analytics Using Python (Part_5)

Teena Mary
Budding Data Scientist
17 min readMay 25, 2020
Photo by Lukas Blazek on Unsplash

This is the fifth post among the 12 series of posts in which we will learn about Data Analytics using Python. In this post, we will look into the hypothesis testing of two samples, introduction ANOVA test and post-hoc analysis.

Index

  1. Hypothesis Testing of Population Means with Independent Samples
  2. Hypothesis Testing of Population Means with Dependent Samples
  3. Hypothesis Testing of Population Proportions
  4. Hypothesis Testing of Population Variances
  5. Determining Sample Sizes
  6. Analysis of Variance (ANOVA)
  7. Multiple Comparisons following ANOVA

There are the following classifications for the two sample test:

For the difference between the means, there can be two sub-classes- when the variances of the population from which the samples are known and when they are unknown.

CASE I: Population Variances are known

First, we will look into the case where the population variances are known. The assumptions are that the samples are randomly and independently drawn, both population distributions are normal and also that the population variances are known.

For the null hypothesis Ho: μ1- μ2 = Do, the test statistics is given by

Hypothesis tests for two population means with independent samples and their corresponding decision rules are :-

Hypothesis Tests
Decision Rules

The following is the sampling distribution of the difference of two sample means x̅1 — x̅2 :

The interval estimate of μ1 — μ2 when σ1 and σ2 are known is given by:

Let us see how to solve an example of the above case using python. A product developer is interested in reducing the drying time of a primer paint. Two formulations of the paint are tested; formulation 1 is the standard chemistry, and formulation 2 has a new drying ingredient that should reduce the drying time. From experience, it is known that the standard deviation of drying time is 8 minutes, and this inherent variability should be unaffected by the addition of the new ingredient. Ten specimens are painted with formulation 1, and another 10 specimens are painted with formulation 2 and the 20 specimens are painted in random order. The two-sample average drying times are 𝑥1 = 121 minutes and 𝑥2 = 112 minutes, respectively. What conclusions can the product developer draw about the effectiveness of the new ingredient, using alpha = 0.05?

import pandas as pd
import numpy as np
import math
from scipy import stats
def Z_and_p(x1,x2,sigma1,sigma2,n1,n2):
z=(x1-x2)/(math.sqrt(((sigma1**2)/n1)+((sigma2**2)/n2)))
if (z<0):
p=stats.norm.cdf(z)
else:
p=1-stats.norm.cdf(z)
print(z,p)
Z_and_p(121,112,8,8,10,10)#2.5155764746872635 0.00594189462107364

So, here the calculated z value is 2.5155764746872635 and the p-value is 0.00594189462107364. So, here the p-value 0.0059 is less than the α-value, which is 0.05. Hence we reject the null hypothesis and hence, there will be a difference in the drying time of a primer paint even after adding the ingredient. Hence, the ingredient is effective.

CASE II: Population Variances are unknown, but assumed equal

Now we will look into the second case when the population variations are unknown but assumed to be equal. The assumptions in this case are that the samples are randomly and independently drawn and that the populations are normally distributed.

Since the population variances are assumed equal, we can use the two sample standard deviations and pool them to estimate σ using a t-value with (n1 + n2–2) degrees of freedom.

Hypothesis tests for two population means with independent samples and unknown variances (assumed unequal) and their corresponding decision rules are :-

Hypothesis Tests
Decision Rules

Let us see how to solve an example of the above case using python. Two catalysts are being analyzed to determine how they affect the mean yield of a chemical process. Specifically, catalyst 1 is currently in use, but catalyst 2 is acceptable. Since catalyst 2 is cheaper, it should be adopted, providing it does not change the process yield. A test is run in the pilot plant and results are tabulated. Is there any difference between the mean yields? Use 0.05, and assume equal variances.

a=[91.5,94.18,92.18,95.39,91.79,89.07,94.72,89.21]
b=[89.19,90.95,90.46,93.21,97.19,97.04,91.07,92.75]
stats.ttest_ind(a,b,equal_var=True)
#Ttest_indResult(statistic=-0.3535908643461798, pvalue=0.7289136186068217)
stats.t.ppf(0.025,14) #critical t value
#-2.1447866879169277

Since -2.144<t0-value, ie, -0.353 < -2.144, the null hypothesis is accepted. That is, at 0.05 confidence level, we do not have strong evidence to conclude that catalyst 2 results in a mean yield that differs from the mean yield when catalyst 1 is used.

CASE III: Population Variances are unknown, but assumed unequal

Now we will look into the third case when the population variances are unknown but assumed to be unequal. The assumptions in this case are that the samples are randomly and independently drawn and that the populations are normally distributed.

To form the interval estimates, the population variances are assumed unequal, so a pooled variance is not appropriate. Hence, we use a t-value with v degrees of freedom, where v is given by:

The test statistic for μ1 — μ2 is given by:

Hypothesis tests for two population means with independent samples and unknown variances (assumed unequal) and their corresponding decision rules are same as CASE 2.

Let’s see a use case using Python. Arsenic concentration in public drinking water supplies is a potential health risk. An article in the Arizona Republic (Sunday, May 27, 2001) reported drinking water arsenic concentrations in parts per billion (ppb) for 10 metropolitan Phoenix communities and 10 communities in rural Arizona. Data was collected and is given. Determine if there is any difference in mean arsenic concentrations between metropolitan Phoenix communities and communities in rural Arizona.

metro=[3,7,25,10,15,6,12,25,15,7]
rural=[48,44,40,38,33,21,20,12,1,18]
stats.ttest_ind(metro,rural,equal_var=False)
#Ttest_indResult(statistic=-2.7669395785560558, pvalue=0.015827284816100885)
stats.t.ppf(0.025,13) #critical t value
#-2.160368656461013

Since t0 value = -2.766 is less than the critical t-value, t(0.025,13)=-2.1603, we reject the null hypothesis. Hence, there is evidence to conclude that mean arsenic concentration in the drinking water in rural Arizona is different from the mean arsenic concentration in metropolitan Phoenix drinking water.

Dependent Samples

Another probability is when the samples are dependent. Then the data will consist of paired or matched sample points. From the same population, data is collected before and after certain period of time. In such cases, we use the difference between the paired samples, ie, d(i) = x(i) — y(i). Here also, the assumption is that both the populations are normally distributed.

The test statistic for the mean difference is a t value, with n — 1 degrees of freedom is given by:

Hypothesis tests for two population means with dependent samples and their corresponding decision rules are :-

Hypothesis Tests for Dependent Samples
Decision Rule

Example: An article in the Journal of Strain Analysis (1983, Vol. 18, №2) compares several methods for predicting the shear strength for steel plate girders. Data for two of these methods, the Karlsruhe and Lehigh procedures, when applied to nine specific girders, are shown in Table. Determine whether there is any difference (on the average) between the two methods.

KARL=[1.186,1.151,1.322,1.339,1.200,1.402,1.365,1.537,1.559]
LEH=[1.061,0.992,1.063,1.062,1.065,1.178,1.037,1.086,1.052]
stats.ttest_rel(KARL,LEH)
#Ttest_relResult(statistic=6.0819394375848255, pvalue=0.00029529546278604066)
stats.t.ppf(0.025,8) #critical t value
#-2.306004135033371

Here, since the t0 value=6.081>-2.306, we can conclude that the strength prediction methods yield different results.

Population Proportion

A population proportion, generally denoted by p, is a parameter that describes a percentage value associated with a population. For a proportion p1 from the first population and p2 from the second population, the expected value as well as the standard deviation is given by:

If the sample sizes are large, the sampling distribution of p1(bar) — p2(bar) can be approximated by a normal probability distribution.

The sample sizes are sufficiently large if all of these conditions are met:

Interval Estimation:

Hypothesis Tests about p1 — p2:

We focus on tests involving no difference between the two population proportions (i.e. p1 = p2)

Test Statistic is given by:

Let us look at an example. Extracts of St. John’s Wort are widely used to treat depression. An article in the April 18, 2001 issue of the Journal of the American Medical Association (“Effectiveness of St. John’s Worton Major Depression: A Randomized Controlled Trial”) compared the efficacy of a standard extract of St. John’s Wort with a placebo in 200 outpatients diagnosed with major depression. Patients were randomly assigned to two groups; one group received the St. John’s Wort, and the other received the placebo. After eight weeks, 19 of the placebo-treated patients showed improvement, whereas 27 of those treated with St. John’s Wort improved. Is there any reason to believe that St. John’s Wort is effective in treating major depression? Use 0.05.

import math
def two_sample_proportioin(p1,p2,n1,n2):
p_pool=((p1*n2)+(p2*n1))/(n1+n2)
x=(p_pool*(1-p_pool)*((1/n1)+(1/n2)))
s=math.sqrt(x)
z=(p1-p2)/s
if (z<0):
p_val=stats.norm.cdf(z)
else:
p_val=1-stats.norm.cdf(z)
return z,p_val*2
two_sample_proportioin(0.27,0.19,100,100)
#(1.3442056254198995, 0.17888190308175567)
stats.norm.cdf(1.3442056254198995) #critical z value
#0.9105590484591222

Since z0 value=1.344 does not exceed z(0.025)=0.91, we cannot reject the null hypothesis. So, the P-value is P ≅ 0.178. There is insufficient evidence to support the claim that St. John’s Wort is effective in treating major depression.

Tests for Two Variances

To test hypotheses about two population variances, the F — test is used. Here also the two populations are assumed to be independent and normally distributed. The hypothesis tests are as follows:

The following random variable F has an F distribution with (n1–1) numerator degrees of freedom and (n2–1) denominator degrees of freedom.

When we assume that the population has equal variance, the F statistic becomes:

Most of the time the F-test is a right tailed test because, when variance is involved, the upper limit of the variance is only considered as it is always positive. The decision rule for two variances are given by:

Decision Rule

Example: A company manufactures impellers for use in jet-turbine engines. One of the operations involves grinding a particular surface finish on a titanium alloy component. Two different grinding processes can be used, and both processes can produce parts at identical mean surface roughness. The manufacturing engineer would like to select the process having the least variability in surface roughness. A random sample of n1 =11 parts from the first process results in a sample standard deviation s1 = 5.1 micro inches, and a random sample of n2 = 16 parts from the second process results in a sample standard deviation of s2 = 4.7 micro inches. Find a 90% confidence interval on the ratio of the two standard deviations.

import numpy as np
import pandas as pd
import math
import scipy
from scipy import stats
scipy.stats.f.ppf(q=1-0.05,dfn=15,dfd=10)
#2.8450165269958436
scipy.stats.f.ppf(q=0.05,dfn=15,dfd=10)
#0.3931252536255495
X=[3,7,25,10,15,6,12,25,15,7]
Y=[48,44,40,38,33,21,20,12,1,18]
F=np.var(X)/np.var(Y)
dfn=len(X)-1
dfd=len(Y)-1
p_value=scipy.stats.f.cdf(F,dfn,dfd)
p_value
#0.024680183438910465

Since 0.024<0.393, we reject the null hypothesis and we cannot claim that the standard deviations of surface roughness for the two processes are different at the 90% level of confidence.

When to use Z and t tests

(a) Determining Sample Size when Estimating μ

While estimating the value of μ, the z-test statistic is given by the following Z-formula.

From that, the estimated value of sample size and the estimated standard deviation can be calculated.

(b) Determining Sample Size when Estimating P

While estimating the value of P, the z-test statistic is given by the following Z-formula.

From this statistic, the estimated value of sample size can be calculated as:

Analysis of Variance (ANOVA)

ANOVA is a test that provides a global assessment of a statistical difference in more than two independent means. For the comparison of more than two population, the Analysis of Variance (ANOVA) is used. Chi Square tests can be viewed as a generalization of Z tests of proportions. If comparison of proportion of more than two population is needed, use chi square test. Analysis of Variance (ANOVA) can be viewed as a generalization of t-tests: a comparison of differences of means across more than 2 groups. Like Chi Square, if there are only two groups, the two analyses will produce identical results — thus a t-test or ANOVA can be used with two groups. If Ho is rejected, we cannot conclude that all population means are equal. But, rejecting Ho means that at least two population means have different values.

Importance of ANOVA

Suppose, in a production process, there are different input variables. X1, X2 and Xp, which are the controllable inputs. There are also uncontrollable inputs like Z1, Z2,…., Zq. So, the input maybe features like raw materials, component and sub-assemblies, and output is the quality characteristics. The input factors will affect the quality characteristics Y of the output. If it is affecting Y, then an optimal combination of X1,X2,…,Xp is to be found out which will provide better Y value measurement. So the purpose of ANOVA is to find out how this input variable is affecting the quality characteristics of the output variable.

Assumptions for Analysis of Variance:

•For each population, the response (dependent) variable is normally distributed.

•The variance of the response variable, denoted σ², is the same for all of the populations.

  • The observations must be independent.

Sampling Distribution of x̅ Given Ho is True:

Sample means are close together because there is only one sampling distribution when Ho is true.

Sampling Distribution of x̅ Given Ho is False:

Sample means come from different sampling distributions and are not as close together when Ho is false.
Types of ANOVA

ANOVA: Implementation

Consider an example with four independent groups and a continuous outcome measure. The independent groups might be defined by a particular characteristic of the participants such as BMI (e.g., underweight, normal weight, overweight, obese) or by the investigator (e.g., randomizing participants to one of four competing treatments, call them A, B, C and D). Suppose that the outcome is systolic blood pressure, and we wish to test whether there is a statistically significant difference in mean systolic blood pressures among the four groups. The sample data are organized as follows:

The hypotheses of interest in an ANOVA are as follows:

  • H0: μ1 = μ2 = μ3 = μ4
  • H1: The means are not all equal.

The null hypothesis in ANOVA is always that there is no difference in means. The research or alternative hypothesis is always that the means are not all equal and is usually written in words rather than in mathematical symbols. The research hypothesis captures any difference in means and includes, for example, the situation where all four means are unequal, where one is different from the other three, where two are different, and so on. The alternative hypothesis, as shown above, capture all possible situations other than equality of all means specified in the null hypothesis.

The test statistic for testing H0: μ1 = μ2 = … = μk is:

and the critical value is found in a table of probability values for the F distribution with (degrees of freedom) df1 = k-1, df2=N-k. In the test statistic, nj = the sample size in the jth group (e.g., j =1, 2, 3, and 4 when there are 4 comparison groups), x̅j is the sample mean in the jth group, and x̅ is the overall mean. k represents the number of independent groups (in this example, k=4), and N represents the total number of observations in the analysis. Note that N does not refer to a population size, but instead to the total sample size in the analysis (the sum of the sample sizes in the comparison groups, e.g., N=n1+n2+n3+n4). The test statistic is complicated because it incorporates all of the sample data.

Pre-requisites for ANOVA Implementation:

  • Between-Treatments Estimate of Population Variance σ²: The estimate of σ² based on the variation of the sample means is called the mean square due to treatments and is denoted by MSB.
  • Within-Treatments Estimate of Population Variance σ²: The estimate of σ² based on the variation of the sample observations within each sample is called the mean square error and is denoted by MSE.

We will next illustrate the ANOVA procedure using the five step approach. Because the computation of the test statistic is involved, the computations are often organized in an ANOVA table. The ANOVA table breaks down the components of variation in the data into variation between treatments and error or residual variation. Statistical computing packages also produce ANOVA tables as part of their standard output for ANOVA, and the ANOVA table is set up as follows:

where

  • X = individual observation,
  • x̅j = sample mean of the jth treatment (or group),
  • x̅= overall sample mean,
  • k = the number of treatments or independent comparison groups, and
  • N = total number of observations or total sample size.

The ANOVA table above is organized as follows.

  • The first column is entitled “Source of Variation” and delineates the between treatment and error or residual variation. The total variation is the sum of the between treatment and error variation.
  • The second column is entitled “Sums of Squares (SS)”. The between treatment sums of squares is

and is computed by summing the squared differences between each treatment (or group) mean and the overall mean. The squared differences are weighted by the sample sizes per group (nj). The error sums of squares is:

and is computed by summing the squared differences between each observation and its group mean (i.e., the squared differences between each observation in group 1 and the group 1 mean, the squared differences between each observation in group 2 and the group 2 mean, and so on). The double summation ( SS ) indicates summation of the squared differences within each treatment and then summation of these totals across treatments to produce a single value. (This will be illustrated in the following examples). The total sums of squares is:

and is computed by summing the squared differences between each observation and the overall sample mean. In an ANOVA, data are organized by comparison or treatment groups. If all of the data were pooled into a single sample, SST would reflect the numerator of the sample variance computed on the pooled or total sample. SST does not figure into the F statistic directly. However, SST = SSB + SSE, thus if two sums of squares are known, the third can be computed from the other two.

  • The third column contains degrees of freedom. The between treatment degrees of freedom is df1 = k-1. The error degrees of freedom is df2 = N — k. The total degrees of freedom is N-1 (and it is also true that (k-1) + (N-k) = N-1).
  • The fourth column contains “Mean Squares (MS)” which are computed by dividing sums of squares (SS) by degrees of freedom (df), row by row. Specifically, MSB=SSB/(k-1) and MSE=SSE/(N-k). MSB can also be told as Mean Square due to Treatments (MSTR). Dividing SST/(N-1) produces the variance of the total sample. The F statistic is in the rightmost column of the ANOVA table and is computed by taking the ratio of MSB/MSE.

Conclusion: The appropriate critical value can be found in a table of probabilities for the F distribution. In order to determine the critical value of F we need degrees of freedom, df1=k-1 and df2=N-k.We reject the null hypothesis if the final calculated F value is greater than the critical F value.

Python Implementation of ANOVA

Consider the test for the efficiency of three teaching methods. We need to see which teaching methodology is more influential on student performance at 5% LOS.

import scipy
from scipy import stats
import pandas as pd
import numpy as np
import math
import statsmodels.api as sm
from statsmodels.formula.api import ols
from matplotlib import pyplot as plt
a=[4,3,2]
b=[2,4,6]
c=[2,1,3]
stats.f_oneway(a,b,c)
F_onewayResult(statistic=1.5, pvalue=0.2962962962962962)

Here, the p-value is greater than the α value, hence we accept the null hypothesis and hence we will conclude that all the three teaching methods are equally efficient teaching methodologies.

Pandas.melt command: Pd.melt allows you to ‘unpivot’ data from a ‘wide format’ into a ‘long format’, data with each row representing a data point.

data_new=pd.melt(data.reset_index(),id_vars=['index'],value_vars=['Teachin Method1','Teachin Method2','Teachin Method3'])
data_new.columns=['index','treatment','value']
data_new

In order to get the result as an anova table, we use the following code.

model=ols('value ~ C(treatment)',data=data_new).fit()
anova_table=sm.stats.anova_lm(model,typ=1)
anova_table

So, we get the same results as above.

Multiple Comparisons Following the ANOVA

When the null hypothesis is rejected in the ANOVA, we know that some of the treatment or factor level means are different. ANOVA doesn’t identify which means are different. Methods for investigating this issue are called multiple comparisons methods.

Fisher’s least significant difference (LSD) method

The Fisher LSD method compares all pairs of means with the null hypotheses H0: μi=μj (for all i ≠ j) using the t-statistic:

Assuming a two-sided alternative hypothesis, the pair of means i and j would be declared significantly different if:

If the sample sizes are different in each treatment, the LSD is defined as:

The Tukey-Kramer Test for Post Hoc analysis

This test tells which population means are significantly different. It is done after rejection of equal means in ANOVA. It allows pair-wise comparisons. It compares absolute mean differences with critical range.

So, when we reject a null hypothesis, the following figure says that μ1 equal to μ2 but μ3 is different. So, this which two pairs of means is equal that we can find with the help of this Tukey-Kramer test.

Tukey-Kramer Critical Range:

Reference:

  1. http://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/BS704_HypothesisTesting-ANOVA/BS704_HypothesisTesting-Anova_print.html

--

--

Teena Mary
Budding Data Scientist

I’m a post graduate student doing Data Science from Christ (Deemed to be University) in Bengaluru.