Statistics for Interview

Yashwanth Reddy
25 min readFeb 18, 2023

--

(cross tab,pivot table->pandas 2 Categorical table)

  1. Central Limit Theorem:-(it’s not normally distributed)

Note:- If it normally distributed,we can apply correlation.

https://www.geeksforgeeks.org/introduction-of-statistics-and-its-types/

A simulation to explain Central Limit Theorem: even when a sample is not normally distributed, if you draw multiple samples and take each of their averages, these averages will represent a normal distribution.

or

The sample mean will approximately be normally distributed for large sample sizes, regardless of the distribution from which we are sampling.

Central Limit Theorem suggests that if you randomly draw a sample of your customers, say 1000 customers, this sample itself might not be normally distributed. But if you now repeat the experiment say 100 times, then the 100 means of those 100 samples (of 1000 customers) will make up a normal distribution.

Suppose we are sampling from a population with a finite mean and a finite standard-deviation(sigma). Then Mean and standard deviation of the sampling distribution of the sample mean can be given as

Where X(bar) represents the sampling distribution of the sample mean of size n each,

and

are the mean and standard deviation of the population respectively.
The distribution of the sample tends towards the normal distribution as the sample size increases.
Code: Python implementation of the Central Limit Theorem

import numpy
import matplotlib.pyplot as plt

# number of sample
num = [1, 10, 50, 100]
# list of sample means
means = []

# Generating 1, 10, 30, 100 random numbers from -40 to 40
# taking their mean and appending it to list means.
for j in num:
# Generating seed so that we can get same result
# every time the loop is run...
numpy.random.seed(1)
x = [numpy.mean(
numpy.random.randint(
-40, 40, j)) for _i in range(1000)]
means.append(x)
k = 0

# plotting all the means in one figure
fig, ax = plt.subplots(2, 2, figsize =(8, 8))
for i in range(0, 2):
for j in range(0, 2):
# Histogram for each x stored in means
ax[i, j].hist(means[k], 10, density = True)
ax[i, j].set_title(label = num[k])
k = k + 1
plt.show()

https://www.analyticsvidhya.com/blog/2019/05/statistics-101-introduction-central-limit-theorem/#:~:text=Formally%20Defining%20the%20Central%20Limit%20Theorem&text=These%20samples%20should%20be%20sufficient,of%20your%20samples%20gets%20larger.&text=The%20central%20limit%20theorem%20has,of%20applications%20in%20many%20fields.

further ref->

another approach:-

We import the necessary packages and define a population of size 1000000 consisting of random numbers. The population is completely random as in real life scenarios.

import numpy.random as np
import seaborn as sns
import matplotlib.pyplot as plt
population_size = 1000000
population = np.rand(1000000)

We define the number of resampling times or the number of samples drawn from population with replacement to be 10000. As of now ‘sample_means’ is randomly initialised. Later it will be used to store the means of samples drawn from population. We define the ‘sample_size’ to be 1. Later we will experiment with different values of ‘sample_size’.

number_of_samples = 10000
sample_means = np.rand(number_of_samples)
sample_size = 1

We run a ‘for loop’ 10000 times. Each time ‘c’ takes up integer values between 1 and population_size and size of ‘c’ is same as ‘sample_size’. The sample is drawn from population and its mean is stored in ‘sample_mean’.

c = np.rand(number_of_samples)
for i in range(0,number_of_samples):
c = np.randint(1,population_size,sample_size)
sample_means[i] = population[c].mean()

The following lines of code are for plotting the histogram and density of sample mean.

plt.subplot(1,2,1)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
sns.distplot(sample_means,bins=int(180/5),hist = True,kde = False)
plt.title(‘Histogram of Sample mean’,fontsize=20)
plt.xlabel(‘Sample mean’,fontsize=20)
plt.ylabel(‘Count’,fontsize=20)
plt.subplot(1,2,2)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
sns.distplot(sample_means,hist = False,kde = True)
plt.title(‘Density of Sample mean’,fontsize=20)
plt.xlabel(‘Sample mean’,fontsize=20)
plt.ylabel(‘Density’,fontsize=20)
plt.subplots_adjust(bottom=0.1, right=2, top=0.9)

Now that we have understood the code, let us look at the graph of ‘sampling distribution of sample mean’ for different values of sample size.

Sample size = 1

Sample size = 2

Sample size = 5

Sample size = 10

Sample size = 30

We can see that the distribution approaches normal as sample size gets larger. In theory the distribution is perfectly normal only when the sample size tends to infinity. But practically we can assume the distribution is normal when sample size is greater than or equal to 30.

What is population and sample in statistics?

A population is the entire group that you want to draw conclusions about. A sample is the specific group that you will collect data from. The size of the sample is always less than the total size of the population.

Null and alternative Hypothesis:-

Difference between Type 1 and Type 2 Error

Type — 1 error is known as false positive, i.e., when we reject the correct null hypothesis, whereas type -2 error is also known as a false negative, i.e., when we fail to reject the false null hypothesis. In this article, we will discuss difference between type 1 and type 2 error.

Researcher/scientist assumes to prove or disprove their finding. These assumptions are also known as hypotheses. There are mainly two types of hypotheses Null and Alternative Hypothesis. Null and Alternative hypotheses are mutually exclusive statements. A null hypothesis statement is that there is no relation between the two variables. In contrast, an alternative hypothesis is a statement that refers to the statistical relationship between the two variables. While doing hypothesis testing, we encounter two types of errors, i.e., type-1 and type-2 errors. This article will discuss the difference between type- 1 and type-2 errors.
Type-1 and Type -2 errors are interconnected; reducing one can increase the probability of another. Type — 1 error is a false-positive finding, while type-2 error is a false-negative finding in hypothesis testing.

https://www.naukri.com/learning/articles/difference-between-type-1-and-type-2-error/

(or)

Type 1 error and Type II error:

  • Type I error: Type 1 error has occurred when we reject the null hypothesis, even when the hypothesis is true. This error is denoted by alpha.
  • Type II error: Type II error has occurred when we didn’t reject the null hypothesis, even when the hypothesis is false. This error is denoted by beta.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

ANOVA Test

ANOVA also known as Analysis of variance is used to investigate relations between categorical variables and continuous variable in Python Programming. It is a type of hypothesis testing for population variance.

ANOVA test involves setting up:

  • Null Hypothesis: All population means are equal.
  • Alternate Hypothesis: Atleast one population mean is different from other.

ANOVA tests are of two types:

  • One way ANOVA: It takes one categorical group into consideration.
  • Two way ANOVA: It takes two categorical group into consideration.
pip3 install scipy

Stepwise Implementation:

(A one-way ANOVA is used to determine whether or not there is a statistically significant difference between the means of three or more independent groups)

Conducting a One-Way ANOVA test in Python is a step by step process and these steps are explained below:

Step 1: Creating data groups.

The very first step is to create three arrays that will keep the information of cars when d

  • Python3

# Performance when each of the engine

# oil is applied

performance1 = [89, 89, 88, 78, 79]

performance2 = [93, 92, 94, 89, 88]

performance3 = [89, 88, 89, 93, 90]

performance4 = [81, 78, 81, 92, 82]

Step 2: Conduct the one-way ANOVA:

Python provides us f_oneway() function from SciPy library using which we can conduct the One-Way ANOVA.

  • Python3

# Importing library

from scipy.stats import f_oneway

# Performance when each of the engine

# oil is applied

performance1 = [89, 89, 88, 78, 79]

performance2 = [93, 92, 94, 89, 88]

performance3 = [89, 88, 89, 93, 90]

performance4 = [81, 78, 81, 92, 82]

# Conduct the one-way ANOVA

f_oneway(performance1, performance2, performance3, performance4)

Output:

Output

Step 3: Analyse the result:

The F statistic and p-value turn out to be equal to 4.625 and 0.016336498 respectively. Since the p-value is less than 0.05 hence we would reject the null hypothesis. This implies that we have sufficient proof to say that there exists a difference in the performance among four different engine oils.

How to Perform Tukey’s Test in Python

A one-way ANOVA is used to determine whether or not there is a statistically significant difference between the means of three or more independent groups.

If the overall p-value from the ANOVA table is less than some significance level, then we have sufficient evidence to say that at least one of the means of the groups is different from the others.

However, this doesn’t tell us which groups are different from each other. It simply tells us that not all of the group means are equal. In order to find out exactly which groups are different from each other, we must conduct a post hoc test.

One of the most commonly used post hoc tests is Tukey’s Test, which allows us to make pairwise comparisons between the means of each group while controlling for the family-wise error rate.

This tutorial provides a step-by-step example of how to perform Tukey’s Test in Python.

from statsmodels.stats.multicomp import pairwise_tukeyhsd
df=mpg[mpg['cylinders']==4][['mpg','origin']]
result=pairwise_tukeyhsd(endog=df['mpg'],groups=df['origin'],alpha=0.05)
print(result)

Two-Way annova:-

Two-Way ANOVA: Two-Way ANOVA in statistics stands for Analysis of Variance and it is used to check whether there is a statistically significant difference between the mean value of three or more that has been divided into two factors.

Factors:-Factors are the data objects which are used to categorize the data and store it as levels. They can store both strings and integers.

# Importing libraries
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Create a dataframe
dataframe = pd.DataFrame({'Fertilizer': np.repeat(['daily', 'weekly'], 15),
'Watering': np.repeat(['daily', 'weekly'], 15),
'height': [14, 16, 15, 15, 16, 13, 12, 11,
14, 15, 16, 16, 17, 18, 14, 13,
14, 14, 14, 15, 16, 16, 17, 18,
14, 13, 14, 14, 14, 15]})


# Performing two-way ANOVA
model = ols('height ~ C(Fertilizer) + C(Watering) +\
C(Fertilizer):C(Watering)',
data=dataframe).fit()
result = sm.stats.anova_lm(model, type=2)

# Print the result
print(result)

Need to refer ->

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Variance test(mcr)

We use chi square statistic for One variance test,We use F-test statistic for two variances

Chi-square test

For testing the population variance against a specified value

Testing goodness of fit of some probability distribution

Testing of independence of two attributes (Contingency tables)

F-test

For testing equality of two variances from different populations

For testing equality of several means with technique of ANOVA.

Goodness of fit (Mcr calculation):- chisquare test,contingency table,(cross tab,pivot table->pandas 2 Categorical table)

Chisquare _test(degree of freedom)

To test if the sample is coming from a population with specific distribution

H0: The data follow a specified distribution.

Ha: The data do not follow the specific distribution.

We will talk about Chi square test again in goodness of fit and contingency table section.

exp=[50,50]
obs=[40,60]
stats.chisquare(obs,exp)

Contingency Tables

  • To find relationship between two discrete variables.

Null hypothesis is that there is no relationship between the row and the columns

Alternate hypothesis is that there is a relationship. Alternate hypothesis does not tell what type of relationship exists.

sh_op=np.array([[22,26,23],[28,62,26],[72,22,66]])
stats.chi2_contingency(sh_op)

Ftest:-

For testing equality of two variances from different populations

For testing equality of several means with technique of ANOVA.

ipy.stats import f
F_cal=11/1.21
F_cal
F_cri_right=f.isf(0.05,4,7)
F_cri_right
F_cri_left=f.isf(0.95,4,7)
F_cri_left

another approach:-

import numpy as np
import scipy.stats

# Create data
group1 = [0.28, 0.2, 0.26, 0.28, 0.5]
group2 = [0.2, 0.23, 0.26, 0.21, 0.23]

# converting the list to array
x = np.array(group1)
y = np.array(group2)

# calculate variance of each group
print(np.var(group1), np.var(group2))

def f_test(group1, group2):
f = np.var(group1, ddof=1)/np.var(group2, ddof=1)
nun = x.size-1
dun = y.size-1
p_value = 1-scipy.stats.f.cdf(f, nun, dun)
return f, p_value

# perform F-test
f_test(x, y)

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Z-test:-

Random samples

Each observation should be independent of other

Sampling with replacement

If sampling without replacement, the sample size should not be more than 10% of the population.

Sampling distribution approximates Normal Distribution

Population is normally distributed and the population standard deviation is known as *** OR **

Sample size>=30

Z-test is a statistical method to determine whether the distribution of the test statistics can be approximated by a normal distribution. It is the method to determine whether two sample means are approximately the same or different when their variance is known and the sample size is large (should be >= 30).

#ztest≥30sample ->mean

When to Use Z-test:

  • The sample size should be greater than 30. Otherwise, we should use the t-test.
  • Samples should be drawn at random from the population.
  • The standard deviation of the population should be known.
  • Samples that are drawn from the population should be independent of each other.
  • The data should be normally distributed, however for large sample size, it is assumed to have a normal distribution.

Type of Z-test

  • Left-tailed Test: In this test, our region of rejection is located to the extreme left of the distribution. Here our null hypothesis is that the claimed value is less than or equal to the mean population value.
  • Right-tailed Test: In this test, our region of rejection is located to the extreme right of the distribution. Here our null hypothesis is that the claimed value is less than or equal to the mean population value.
  • Two-tailed test: In this test, our region of rejection is located to both extremes of the distribution. Here our null hypothesis is that the claimed value is equal to the mean population value.

How to Perform One Sample & Two Sample Z-Tests in Python

You can use the ztest() function from the statsmodels package to perform one sample and two sample z-tests in Python.

This function uses the following basic syntax:

statsmodels.stats.weightstats.ztest(x1, x2=None, value=0)

where:

  • x1: values for the first sample
  • x2: values for the second sample (if performing a two sample z-test)
  • value: mean under the null (in one sample case) or mean difference (in two sample case)

The following examples shows how to use this function in practice.

Example 1: One Sample Z-Test in Python

Suppose the IQ in a certain population is normally distributed with a mean of μ = 100 and standard deviation of σ = 15.

A researcher wants to know if a new drug affects IQ levels, so he recruits 20 patients to try it and records their IQ levels.

The following code shows how to perform a one sample z-test in Python to determine if the new drug causes a significant difference in IQ levels:

from statsmodels.stats.weightstats import ztest as ztest
#enter IQ levels for 20 patients
data = [88, 92, 94, 94, 96, 97, 97, 97, 99, 99,
105, 109, 109, 109, 110, 112, 112, 113, 114, 115]
#perform one sample z-test
ztest(data, value=100)
(1.5976240527147705, 0.1101266701438426)
The test statistic for the one sample z-test is 1.5976 and the corresponding p-value is 0.1101.
Since this p-value is not less than .05, we do not have sufficient evidence to reject the null hypothesis. In other words, the new drug does not significantly affect IQ level.

Example 2: Two Sample Z-Test in Python

Suppose the IQ levels among individuals in two different cities are known to be normally distributed with known standard deviations.

A researcher wants to know if the mean IQ level between individuals in city A and city B are different, so she selects a simple random sample of 20 individuals from each city and records their IQ levels.

The following code shows how to perform a two sample z-test in Python to determine if the mean IQ level is different between the two cities:

from statsmodels.stats.weightstats import ztest as ztest

#enter IQ levels for 20 individuals from each city
cityA = [82, 84, 85, 89, 91, 91, 92, 94, 99, 99,
105, 109, 109, 109, 110, 112, 112, 113, 114, 114]

cityB = [90, 91, 91, 91, 95, 95, 99, 99, 108, 109,
109, 114, 115, 116, 117, 117, 128, 129, 130, 133]

#perform two sample z-test
ztest(cityA, cityB, value=0)

(-1.9953236073282115, 0.046007596761332065)

The test statistic for the two sample z-test is -1.9953 and the corresponding p-value is 0.0460.

Since this p-value is less than .05, we have sufficient evidence to reject the null hypothesis. In other words, the mean IQ level is significantly different between the two cities.

or

#two-z-test
from statsmodels.stats import weightstats
import scipy.stats as stats
m1=df[df['Machine']== 'Machine 1']['Volume']
m2=df[df['Machine']== 'Machine 2']['Volume']

weightstats.ztest(m1,m2)

or

# imports
import math
import numpy as np
from numpy.random import randn
from statsmodels.stats.weightstats import ztest

# Generate a random array of 50 numbers having mean 110 and sd 15
# similar to the IQ scores data we assume above
mean_iq = 110
sd_iq = 15/math.sqrt(50)
alpha =0.05
null_mean =100
data = sd_iq*randn(50)+mean_iq
# print mean and sd
print('mean=%.2f stdv=%.2f' % (np.mean(data), np.std(data)))

# now we perform the test. In this function, we passed data, in the value parameter
# we passed mean value in the null hypothesis, in alternative hypothesis we check whether the
# mean is larger

ztest_Score, p_value= ztest(data,value = null_mean, alternative='larger')
# the function outputs a p_value and z-score corresponding to that value, we compare the
# p-value with alpha, if it is greater than alpha then we do not null hypothesis
# else we reject it.

if(p_value < alpha):
print("Reject Null Hypothesis")
else:
print("Fail to Reject NUll Hypothesis")

T-test:-

Condition for T-test

Random samples

Each observation should be independent of other

Sampling with replacement

If sampling without replacement, the sample size should not be more than 10% of the population.

Sampling distribution approximates Normal Distribution

Population is normally distributed and the population standard deviation is unknown as *** AND **

Sample size < 30

How to Conduct a One Sample T-Test in Python

A one sample t-test is used to determine whether or not the mean of a population is equal to some value.

This tutorial explains how to conduct a one sample t-test in Python.

Example: One Sample t-Test in Python

Suppose a botanist wants to know if the mean height of a certain species of plant is equal to 15 inches. She collects a random sample of 12 plants and records each of their heights in inches.

Use the following steps to conduct a one sample t-test to determine if the mean height for this species of plant is actually equal to 15 inches.

Step 1: Create the data.

First, we’ll create an array to hold the measurements of the 12 plants:

data = [14, 14, 16, 13, 12, 17, 15, 14, 15, 13, 15, 14]

Step 2: Conduct a one sample t-test.

Next, we’ll use the ttest_1samp() function from the scipy.stats library to conduct a one sample t-test, which uses the following syntax:

ttest_1samp(a, popmean)

where:

  • a: an array of sample observations
  • popmean: the expected population mean

Here’s how to use this function in our specific example:

import scipy.stats as stats
#perform one sample t-test
stats.ttest_1samp(a=data, popmean=15)
(statistic=-1.6848, pvalue=0.1201)

The t test statistic is -1.6848 and the corresponding two-sided p-value is 0.1201.

Step 3: Interpret the results.

The two hypotheses for this particular one sample t-test are as follows:

H0: µ = 15 (the mean height for this species of plant is 15 inches)

HA: µ ≠15 (the mean height is not 15 inches)

Because the p-value of our test (0.1201) is greater than alpha = 0.05, we fail to reject the null hypothesis of the test. We do not have sufficient evidence to say that the mean height for this particular species of plant is different from 15 inches.

How to Conduct a Paired Samples T-Test in Python

A paired samples t-test is used to compare the means of two samples when each observation in one sample can be paired with an observation in the other sample.

This tutorial explains how to conduct a paired samples t-test in Python.

How to Conduct a Two Sample T-Test in Python

A two sample t-test is used to test whether or not the means of two populations are equal.

This tutorial explains how to conduct a two sample t-test in Python.

Example: Two Sample t-Test in Python

Researchers want to know whether or not two different species of plants have the same mean height. To test this, they collect a simple random sample of 20 plants from each species.

Use the following steps to conduct a two sample t-test to determine if the two species of plants have the same height.

Step 1: Create the data.

First, we’ll create two arrays to hold the measurements of each group of 20 plants:

import numpy as np
group1 = np.array([14, 15, 15, 16, 13, 8, 14, 17, 16, 14, 19, 20, 21, 15, 15, 16, 16, 13, 14, 12])
group2 = np.array([15, 17, 14, 17, 14, 8, 12, 19, 19, 14, 17, 22, 24, 16, 13, 16, 13, 18, 15, 13])

Step 2: Conduct a two sample t-test.

Next, we’ll use the ttest_ind() function from the scipy.stats library to conduct a two sample t-test, which uses the following syntax:

ttest_ind(a, b, equal_var=True)

where:

  • a: an array of sample observations for group 1
  • b: an array of sample observations for group 2
  • equal_var: if True, perform a standard independent 2 sample t-test that assumes equal population variances. If False, perform Welch’s t-test, which does not assume equal population variances. This is True by default.

Before we perform the test, we need to decide if we’ll assume the two populations have equal variances or not. As a rule of thumb, we can assume the populations have equal variances if the ratio of the larger sample variance to the smaller sample variance is less than 4:1.

#find variance for each group
print(np.var(group1), np.var(group2))
7.73 12.26

The ratio of the larger sample variance to the smaller sample variance is 12.26 / 7.73 = 1.586, which is less than 4. This means we can assume that the population variances are equal.

Thus, we can proceed to perform the two sample t-test with equal variances:

import scipy.stats as stats
#perform two sample t-test with equal variances
stats.ttest_ind(a=group1, b=group2, equal_var=True)
(statistic=-0.6337, pvalue=0.53005)

The t test statistic is -0.6337 and the corresponding two-sided p-value is 0.53005.

Step 3: Interpret the results.

The two hypotheses for this particular two sample t-test are as follows:

H0: µ1 = µ2 (the two population means are equal)

HA: µ1 ≠µ2 (the two population means are not equal)

Because the p-value of our test (0.53005) is greater than alpha = 0.05, we fail to reject the null hypothesis of the test. We do not have sufficient evidence to say that the mean height of plants between the two populations is different.

How to Conduct a Paired Samples T-Test in Python

A paired samples t-test is used to compare the means of two samples when each observation in one sample can be paired with an observation in the other sample.

This tutorial explains how to conduct a paired samples t-test in Python.

Example: Paired Samples T-Test in Python

Suppose we want to know whether a certain study program significantly impacts student performance on a particular exam. To test this, we have 15 students in a class take a pre-test. Then, we have each of the students participate in the study program for two weeks. Then, the students retake a test of similar difficulty.

To compare the difference between the mean scores on the first and second test, we use a paired samples t-test because for each student their first test score can be paired with their second test score.

Perform the following steps to conduct a paired samples t-test in Python.

Step 1: Create the data.

First, we’ll create two arrays to hold the pre and post-test scores:

pre = [88, 82, 84, 93, 75, 78, 84, 87, 95, 91, 83, 89, 77, 68, 91]
post = [91, 84, 88, 90, 79, 80, 88, 90, 90, 96, 88, 89, 81, 74, 92]

Step 2: Conduct a Paired Samples T-Test.

Next, we’ll use the ttest_rel() function from the scipy.stats library to conduct a paired samples t-test, which uses the following syntax:

ttest_rel(a, b)

where:

  • a: an array of sample observations from group 1
  • b: an array of sample observations from group 2

Here’s how to use this function in our specific example:

import scipy.stats as stats
#perform the paired samples t-test
stats.ttest_rel(pre, post)
(statistic=-2.9732, pvalue=0.0101)

The test statistic is -2.9732 and the corresponding two-sided p-value is 0.0101.

Step 3: Interpret the results.

In this example, the paired samples t-test uses the following null and alternative hypotheses:

H0: The mean pre-test and post-test scores are equal

HA:The mean pre-test and post-test scores are not equal

Since the p-value (0.0101) is less than 0.05, we reject the null hypothesis. We have sufficient evidence to say that the true mean test score is different for students before and after participating in the study program.

— — — — — — — — — — — — — — — — — another approach — — — — — — -

Two sample T-Test in Python

Let us consider an example, we are given two-sample data, each containing heights of 15 students of a class. We need to check whether two different class students have the same mean height. There are three ways to conduct a two-sample T-Test in Python.

Method 1: Using Scipy library

Scipy stands for scientific python and as the name implies it is a scientific python library and it uses Numpy under the cover. This library provides a variety of functions that can be quite useful in data science. Firstly, let’s create the sample data. Now let’s perform two sample T-Test. For this purpose, we have ttest_ind() function in Python.

Syntax: ttest_ind(data_group1, data_group2, equal_var=True/False)

Here,

  • data_group1: First data group
  • data_group2: Second data group
  • equal_var = “True”: The standard independent two sample t-test will be conducted by taking into consideration the equal population variances.
  • equal_var = “False”: The Welch’s t-test will be conducted by not taking into consideration the equal population variances.

Note that by default equal_var is True

Before conducting the two-sample T-Test we need to find if the given data groups have the same variance. If the ratio of the larger data groups to the small data group is less than 4:1 then we can consider that the given data groups have equal variance. To find the variance of a data group, we can use the below syntax,

Syntax: print(np.var(data_group))

Here,

  • data_group: The given data group
  • Python3
# Python program to display variance of data groups

# Import library
import scipy.stats as stats

# Creating data groups
data_group1 = np.array([14, 15, 15, 16, 13, 8, 14,
17, 16, 14, 19, 20, 21, 15,
15, 16, 16, 13, 14, 12])
data_group2 = np.array([15, 17, 14, 17, 14, 8, 12,
19, 19, 14, 17, 22, 24, 16,
13, 16, 13, 18, 15, 13])

# Print the variance of both data groups
print(np.var(data_group1), np.var(data_group2))

Two sample T-Test

Here, the ratio is 12.260 / 7.7275 which is less than 4:1.

Performing Two-Sample T-Test

  • Python3

# Python program to demonstrate how to

# perform two sample T-test

# Import the library

import scipy.stats as stats

# Creating data groups

data_group1 = np.array([14, 15, 15, 16, 13, 8, 14,

17, 16, 14, 19, 20, 21, 15,

15, 16, 16, 13, 14, 12])

data_group2 = np.array([15, 17, 14, 17, 14, 8, 12,

19, 19, 14, 17, 22, 24, 16,

13, 16, 13, 18, 15, 13])

# Perform the two sample t-test with equal variances

stats.ttest_ind(a=data_group1, b=data_group2, equal_var=True)

Output:

Performing Two-Sample T-Test

Analyzing the result:

Two sample t-test has the following hypothesis:

H0 => µ1 = µ2 (population mean of dataset1 is equal to dataset2)

HA => µ1 ≠µ2 (population mean of dataset1 is different from dataset2)

Here, since the p-value (0.53004) is greater than alpha = 0.05 so we cannot reject the null hypothesis of the test. We do not have sufficient evidence to say that the mean height of students between the two data groups is different.

Method 2: Two-Sample T-Test with Pingouin

Pingouin is a statistical-type package project that is based on Pandas and NumPy. Pingouin provides a wide range of features. The package is used to conduct the T-Test but also for computing the degree of freedoms, Bayes factor, etc.

Firstly, let’s create the sample data. We are creating two arrays and now let’s perform two sample T-Test. For this purpose, we have ttest() function in the pingouin package of Python. The syntax is given below,

Syntax: ttest(data_group1, data_group2, correction = True/False)

Here,

  • data_group1: First data group
  • data_group2: Second data group
  • correction = “True”: The standard independent two sample t-test will be conducted by taking into consideration the homogeneity assumption.
  • correction = “False”: The Welch’s t-test will be conducted by not taking into consideration the homogeneity assumption.

Note that by default equal_var is True

Example:

  • Python3

# Python program to conduct two-sample

# T-test using pingouin library

# Importing library

from statsmodels.stats.weightstats import ttest_ind

import numpy as np

import pingouin as pg

# Creating data groups

data_group1 = np.array([160, 150, 160, 156.12, 163.24,

160.56, 168.56, 174.12,

167.123, 165.12])

data_group2 = np.array([157.97, 146, 140.2, 170.15,

167.34, 176.123, 162.35, 159.123,

169.43, 148.123])

# Conducting two-sample ttest

result = pg.ttest(data_group1,

data_group2,

correction=True)

# Print the result

print(result)

Output:

Two-Sample T-Test with Pingouin

Interpreting the result

This is the time to analyze the result. The p-value of the test comes out to be equal to 0.523, which is greater than the significance level alpha (that is, 0.05). This implies that we can say that the average height of students in one class is statistically not different from the average height of students in another class. Also, the Cohen’s D that is obtained in a t-test is in terms of the relative strength. According to Cohen:

  • cohen-d = 0.2 is considered as the ‘small’ effect size
  • cohen-d = 0.5 is considered as the ‘medium’ effect size
  • cohen-d = 0.8 is considered as the ‘large’ effect size

It implies that even if the two data groups’ means don’t differ by 0.2 standard deviations or more then the difference is trivial, even if it is statistically significant.

Method 3: Two-Sample T-Test with Statsmodels

Statsmodels is a python library that is specifically used to compute different statistical models and for conducting statistical tests. This library makes use of R-style modules and dataframes.

Firstly, let’s create the sample data. We are creating two arrays and now let’s perform the two-sample T-test. Statsmodels library provides ttest_ind() function to conduct two-sample T-Test whose syntax is given below,

Syntax: ttest_ind(data_group1, data_group2)

Here,

  • data_group1: First data group
  • data_group2: Second data group

Example:

  • Python3

# Python program to conduct

# two-sample t-test using statsmodels

# Importing library

from statsmodels.stats.weightstats import ttest_ind

import numpy as np

import pingouin as pg

# Creating data groups

data_group1 = np.array([160, 150, 160, 156.12,

163.24,

160.56, 168.56, 174.12,

167.123, 165.12])

data_group2 = np.array([157.97, 146, 140.2, 170.15,

167.34, 176.123, 162.35,

159.123, 169.43, 148.123])

# Conducting two-sample ttest

ttest_ind(data_group1, data_group2)

Output:

Two-Sample T-Test with Statsmodels

Interpreting the result:

This is the time to analyze the result. The p-value of the test comes out to be equal to 0.521, which is greater than the significance level alpha (that is, 0.05). This implies that we can say that the average height of students in one class is statistically not different from the average height of students in another class.

https://www.statology.org/one-sample-t-test-python/
https://www.statology.org/two-sample-t-test-python/
https://www.statology.org/paired-samples-t-test-python/
https://www.geeksforgeeks.org/how-to-conduct-a-two-sample-t-test-in-python/

Other ref:-

--

--

Yashwanth Reddy

👉 Check out my daily newsletter to learn something new about Python and Data Science every day|