Hypotheses Testing— T distribution (Part III)

Pritul Dave :)
6 min readSep 30, 2022

--

Introduction to t-test

The formula for performing z-transform over a feature set is

where 𝜎 is the standard deviation of the population. But often a case we only know the standard deviation of the sample and not of the whole population and so to overcome this problem the t-distribution is introduced.

Thus, in the t-distribution instead of the standard deviation of the population we use the standard deviation of the sample and so variability is more hence the concept of degrees of freedom comes into the picture

Just as the P-value for a z test is a z curve area, the P-value for a t test will be a t-curve area as shown in the below figure. (Pls. refer my part-1 and part-2 article)

Degrees of Freedom in T-test

  • An external parameter introduced to calculate some statistical parameters is the degrees of freedom.
  • For example, consider the equation of mean:
  • Here the mean formula is dependent on the variable parameter 1/n. This parameter is the degrees of freedom for the mean.
  • Let’s consider another example of variance:
  • Here the variance formula is dependent on the mean and any changes will directly affect the variance. Hence, this parameter represents the degrees of freedom for variance.

Pooled variance and Pooled standard deviation

  • Pooled standard deviation is the square root of the pooled variance.
  • Pooled variance is a method for estimating the variance of several different populations when the mean of each population may be different, but one may assume that the variance of each population is the same.

The formula for the pooled standard deviation is as follows:

The standard error over the pooled standard deviation will be calculated as

where, sp is the pooled standard deviation

Examples of applying the t-test

Example 1:
There is a sample of 5 patients for treatment of medicine whose weights are 42, 39, 48, 60 and 41kg. Test whether the average weight of the population is 48 kg or not at 5% significance.

Null Hypothesis: Population weight = 48 kg
Alternate Hypothesis: Population weight != 48 kg
We will apply two tailed test

import numpy as np
X = [42, 39, 48, 60, 41]
N = len(X)
mean = np.mean(X)
std_deviation = np.std(X)
std_error = std_deviation/np.sqrt(N)
print("Mean:",mean)
print("STD Deviation",std_deviation)
print("STD Error",std_error)Mean: 46.0
STD Deviation 7.615773105863909
STD Error 3.4058772731852804

Now let’s apply t-test

t = (48 - mean)/std_error
print("T value",t)T value 0.5872202195147035

Applying degree of freedom

v = N-1

Converting the significance value into t-value

For this we will refer the t-table link: https://www.sjsu.edu/faculty/gerstman/StatPrimer/t-table.pdf

Now since the calculated value (0.58) is less than significance value (2.776) we are accepting the null hypothesis

Example 2:
Comparing whether there is significance difference between two datasets. The dataset are independent of each other.
Note: Since the dataset is less than 30 we are applying the t test

import pandas as pd
A = pd.DataFrame([18,20,36,50,49,36,34,49,41],columns=["A"])
B = pd.DataFrame([29,28,26,35,30,44,46],columns=["B"])A.T
B.T

Step 1:
Calculating the mean and standard deviation of the dataset

mean_A = A.mean().values[0]
mean_B = B.mean().values[0]
N_A = A.shape[0]
N_B = B.shape[0]
std_A = A.std().values[0]
std_B = B.std().values[0]
print("Mean A dataset:",mean_A)
print("Mean B dataset:",mean_B)
print("Standard Deviation A dataset:",std_A)
print("Standard Deviation B dataset:",std_B)Mean A dataset: 37.0
Mean B dataset: 34.0
Standard Deviation A dataset: 11.905880899790658
Standard Deviation B dataset: 8.020806277010642

Step 2:
Here we need to calculate the pooled standard deviation because we have two samples with different means.

import numpy as np
pooled_std = np.sqrt(((N_A-1)*(std_A**2) + (N_B-1)*(std_B**2))/(N_A+N_B-2))print("Pooled standard deviation is: ",pooled_std)Pooled standard deviation is: 10.419761445034554

Step 3:
Calculating the standard error

std_error = pooled_std*(np.sqrt((1/N_A)+(1/N_B)))print("Standard error is: ",std_error)Standard error is:  5.251066191272466

Step 4:
Calculating the t value

t_value = (mean_A-mean_B)/(std_error)
print(t_value)0.5713125469616341

Step 5:
Calculating the degrees of freedom
Here v = n1 + n2–2 = 9+7–2 = 14

Step 6:
Converting the significance value into the t value

Now, since t value is less than the significance value we are accepting the null hypotheses.

Example 3:
Performing the t-test over the dependent samples

10 students were giving a particular test. They were imparted with extra classes for one month and test was retaken. Whether extra classes have benefitted the students or not.

test1 = pd.DataFrame([36,40,38,36,42,38,40,46,58,62],columns=["Test 1"])
test2 = pd.DataFrame([32,42,30,24,18,64,32,40,52,38],columns=["Test 2"])test1.T
test2.T

Since the total data is less than 30, the t test is fruitful.

Step 1:
Calculating the difference between two tests

d = test1.values-test2.values
pd.DataFrame(d,columns=["difference in marks"]).T

Step 2:
We will calculate the mean and standard deviation over the difference.

mean = np.mean(d)
std_deviation = np.std(d)
print("Mean of the difference array:",mean)
print("Standard deviation of the difference array",std_deviation)Mean of the difference array: 6.4
Standard deviation of the difference array 13.35065541462291

Step 3:
Calculating the standard error

N = len(d)
std_error = std_deviation/np.sqrt(N)
print("Standard error:",std_error)Standard error: 4.221847936626804

Step 4
Calculating the t value

t_value = mean/std_error
print('T value:',t_value)T value: 1.5159238551622276

Step 5
Checking the degrees of freedom
v = N-1 = 10–1 = 9

Step 6
Converting the 5% significance value into the t value

Since the t value calculated is less than the significance value. Null hypothesis is correct and there is an impact of extra classes.

Conclusion:

T-test ought to be picked over different tests when one expects to do testing on a dataset that is smaller than 30 data pieces of information. Also, the t-test will be picked over whatever other tests when the standard deviation of the population is obscure. Utilizing a t-test one can analyze the two contrast between two examples or decide if both the examples are of a similar populace or not. This is a vital thing when one is dissecting the information and comes as a piece of the Exploratory Data Analysis. Numerous obscure assumptions can be unfurled by playing out the t-test.

Summary when to use which type of test

Below table is the summary of selecting between z test and t test.

--

--

Pritul Dave :)

❖ Writes about Data Science ❖ MS CS @UTDallas ❖ Ex Researcher @ISRO , @IITDelhi, @MillitaryCollege-AI COE ❖ 3+ Publications ❖ linktr.ee/prituldave