Hypotheses Testing: Chi-Squared Test and one-way ANOVA (Part IV)

Pritul Dave :)
7 min readSep 30, 2022

--

Chi-Square test

Often times we consider the variance inside the population i.e how the dataset varied between two or more different samples of the same population. In that case, the variance test or Chi-square test is utilized

So whenever there involves the study of the mean of sample or population then z-test or t-test are used and whenever there is a requirement to study just only the variation then the Chi-square test is applicable.

In the Chi-Squared test, when
Chi value < Chi critical (Hypotheses is accepted)
Chi value > Chi critical (Hypotheses is rejected)

Formula of the chi-square test

  1. Chi-square formula for a categorical dataset:
image.png
  1. Chi-square formula for a numerical dataset:
image-3.png

where σs2 is the variance of the sample,
σp2 is the variance of the population.

Applications of the chi-square test

The Chi-square test is performed for various purposes, some of which are:

  • This method is commonly used by researchers to determine the differences between different categorical variables in a population.
  • A Chi-square test can also be used as a test for goodness of fit. It enables us to observe how well the theoretical distribution fits the observed distribution.
  • It also works as a test of independence where it enables the researcher to determine if two attributes of a population are associated or not.

Note: Chi-square is a non-parametric test and it is applicable to any type of the distribution

Degrees of freedom in the Chi-Square test

Please refer to my article on t-tests to understand what is degrees of freedom.

  • If there is a series of the data in the form of a row or column then degrees of freedom v = N — 1 where N is the number of variables.
  • If the data is arranged in the tabular format known as the Contingency table then the degrees of freedom v = (R-1)*(C-1) where R is the row and C is the column

In chi-square, the mean is the same as degrees of freedom and the standard deviation is √(2𝑣)

Conditions of chi-square test

For the chi-square test to be performed, the following conditions are to be satisfied:

  • The observations are to be recorded and collected on a random basis.
  • The items in the samples should all be independent.
  • The frequencies of data in a group should not be less than 10. Under such conditions, regrouping of items should be done by combining frequencies.
  • The total number of individual items in the sample should also be reasonably large, about 50 or more.
  • The constraints in the frequencies should be linear and not contain squares or higher powers.

Examples of chi-square

Example 1:

A machine that has a variance in producing X product is 7.2. A random sample of size 20 has a variance of 8. Now at the 5% significance ratio, it is assumed that the variability in the sample is more than in the population.



Null Hypotheses: There is variability between the sample and population

Alternate Hypotheses: There is no variability between the sample and population

Here since we do not have the mean and only have the variance we are utilizing only the Chi-Square test

n = 20
Vpop = 7.2
Vsamp = 8
import math
std_pop = math.sqrt(Vpop)
std_samp = math.sqrt(Vsamp)
chi_square = (std_samp**2)*(n-1)/(std_pop**2)
print("Chi-square value",chi_square)
>>> Chi-square value 21.111111111111114

The degree of freedom is v = n-1 = 19

Now, converting the 5% significance value into the chi-square value which is the chi critical value which is 30.144.

The table for conversion is provided at https://people.richland.edu/james/lecture/m170/tbl-chi.html

image.png

Now since the chi value (21.11) is less than the chi critical value of (30.144) we are not rejecting the null hypotheses.

Example 2: (Application in the survey)

This example is taken from the source practicalsurveys.com

A survey of yes and no is taken from Male and Female

image.png

But it was assumed that the survey is incorrect and so the survey is retaken which is as follows

image.png

Test whether the old survey was wrong and different the new survey at 5% significance.

import numpy as np
a1 = np.array([[45,5],[15,35]])
a2 = np.array([[30,20],[30,20]])
print("Old data:\n",a1)
print("New data:\n",a2)
>>>>Old data:
[[45 5]
[15 35]]
New data:
[[30 20]
[30 20]]

Calculating the difference of observed and expected value

d = (a1-a2)
print("d:\n",d)
print("d-square:\n",d**2)
>>>>d:
[[ 15 -15]
[-15 15]]
d-square:
[[225 225]
[225 225]]

Applying chi-square formula

value = d**2 / a2
print("d squre/expected is:\n",value)
>>>>d squre/expected is:
[[ 7.5 11.25]
[ 7.5 11.25]]
chi_square = np.sum(value)
print(chi_square)
>>>> 37.5

Degrees of freedom
v = (R-1)*(C-1) = (2–1)(2–1) = 1

Calculating chi critical value at 5%

image.png

Now, since the chi value is more than the chi critical value we are accepting the null hypothesis that the old survey was wrong.

So this is all about the chi-square hypotheses testing. Next, we will see about the ANOVA (Analysis of Variance)

ANOVA testing

ANOVA is used to check if the means of two or more groups are significantly different from each other by analyzing comparisons of variance estimates.

ANOVA checks the impact of one or more factors by comparing the means of different samples.

In the one-way ANOVA, when
F value < F critical (Hypotheses are accepted)
F value > F critical (Hypotheses are rejected)

When the ANOVA test is applicable

  1. There should not be a dependency between two samples
  2. There should not be heavy variation between two samples
  • Same as one-tailed and two-tailed tests there are two types of ANOVA one-way ANOVA and two-way ANOVA.

One way ANOVA
The one-way ANOVA is a single test to determine the significance of the difference between the means of three or more groups.

image.png

One drawback of one-way ANOVA is that it cannot tell which specific groups are different from each other but can tell that at least two groups are different.

Statistical terminologies as the base of the ANOVA test

1. Grand Mean: The mean of all the sample means is the grand mean. Xgm = (Xm1 + Xm2 + Xm3 + Xm4 +………. Xmk)/k where, k is the number of samples

2. Variability between groups (SSbetween): The difference between the mean of a particular sample and the grand mean is the variability between groups

3. Variability within groups (SSwithin): It refers to the deviation or variance in a particular sample from their respective sample mean.

4. Degrees of freedom: It is given as the summation of sample size minus the total number of samples (v = N-k)

Hypotheses in the ANOVA test

  • In a ANOVA test the null hypothesis is defined as there is no significant difference between the mean of different samples.
  • The alternate hypothesis becomes that there is a difference between the mean of at least one sample.

Example:

As shown below there is one dataset having 3 features.

import pandas as pddf = pd.DataFrame([[13,17,19,11,20,15,18,9,12],[12,8,6,16,12,14,10,18,4],[7,19,15,14,10,16,18,11,14]]).Tdf.rename({0:"Feature 1",1:"Feature 2",2:"Feature 3"},axis=1,inplace=True)
png

Mean and Grand Mean of each of the feature is as follows:

print("Mean of each sample is as follows:\n",df.mean().values)
print("\nGrand Mean is as follows:\n",df.mean().mean())
>>>>
Mean of each sample is as follows:
[14.88888889 11.11111111 13.77777778]

Grand Mean is as follows:
13.25925925925926

Variability between groups will be calculated as follows:

mean_diff = df.mean().values - df.mean().mean()
squared_mean_diff = mean_diff**2

sample_size_N1 = df["Feature 1"].shape[0]
sample_size_N2 = df["Feature 2"].shape[0]
sample_size_N3 = df["Feature 3"].shape[0]

between_diff1 = sample_size_N1*squared_mean_diff[0]
between_diff2 = sample_size_N2*squared_mean_diff[1]
between_diff3 = sample_size_N3*squared_mean_diff[2]
SSbetween = between_diff1+between_diff2+between_diff3
print("SSbetween:",SSbetween)
>>>> SSbetween: 67.85185185185189

Mean SSbetween is calculated as follows:

Number_of_features = df.shape[1]degree_of_freedom_ssb = Number_of_features - 1MSST = SSbetween/degree_of_freedom_ssbprint("Mean of SSbetween is:",MSST)>>>> Mean of SSbetween is: 33.925925925925945

Variability within group will be calculated as follows:

diff_1 = df['Feature 1']-df.mean().values[0]
diff_2 = df['Feature 2']-df.mean().values[1]
diff_3 = df['Feature 3']-df.mean().values[2]
diff_1_squared = diff_1**2
diff_2_squared = diff_2**2
diff_3_squared = diff_3**2
SSwithin_1 = np.sum(diff_1_squared)
SSwithin_2 = np.sum(diff_2_squared)
SSwithin_3 = np.sum(diff_3_squared)
print("Variability within group is as follows: ")
print("Feature 1:",SSwithin_1)
print("Feature 2:",SSwithin_2)
print("Feature 3:",SSwithin_3)
>>>>>
Variability within group is as follows:
Feature 1: 118.88888888888889
Feature 2: 168.88888888888889
Feature 3: 119.55555555555554
SSwithin = SSwithin_1 + SSwithin_2 + SSwithin_3
print("SS within",SSwithin)
>>> SS within 407.3333333333333

Mean of SSwithin

degrees_of_freedom_SSwithin = df.shape[0]*df.shape[1] - df.shape[1]MSSW = SSwithin / degrees_of_freedom_SSwithinprint("Within degrees of freedom:",degrees_of_freedom_SSwithin)
print("Mean of SSwithin:",MSSW)

>>>>
Within degrees of freedom: 24
Mean of SSwithin: 16.97222222222222

F-Statistics
F-statistics is defined as a ratio of the mean sum of squares between the groups (MSST) to the mean sum of squares within groups (MSSW). The formula for F-statistics would look like the following:

F = MSST / MSSW

F_stat = MSST/MSSW
print(F_stat)

>>>> 1.998908892525915

F critical value
Now using the ANOVA table we convert the 5% critical value into F statistics

It will be obtained by referring degrees of freedom (N-1) in Row and (N-k) in column

image.png

Now since the calculated value is 1.99 less than F critical value is 3.4, we are accepting the Null Hypotheses.

This is all about Chi-Squared and ANOVA test.

Thank you for reading my article!!!

--

--

Pritul Dave :)

❖ Writes about Data Science ❖ MS CS @UTDallas ❖ Ex Researcher @ISRO , @IITDelhi, @MillitaryCollege-AI COE ❖ 3+ Publications ❖ linktr.ee/prituldave