Understanding One-Way ANOVA: Complete Guide and Implementation Insights

Vamshinaikjarpula
8 min readJul 9, 2024

--

ANOVA (Analysis of Variance) :

Analysis of variance (ANOVA) is a powerful statistical technique used to compare the means of three or more groups to see if at least one of the group means is significantly different from the others. By partitioning the total variability observed in the data into components attributable to different sources of variation, ANOVA helps in understanding whether the differences in sample means reflect true population differences or are merely due to random chance.

Key Concepts of ANOVA:

  1. Independent Variable (Factor): The variable that categorizes the data into different groups. For example, in a clinical trial, this could be the type of treatment administered.
  2. Dependent Variable: The outcome variable measured in the study, which is expected to change in response to different levels of the independent variable. For example, this could be the blood pressure level of patients.
  3. Null Hypothesis (H0): Assumes that all group means are equal, indicating no effect of the independent variable.
  4. Alternative Hypothesis (Ha): Assumes that at least one group mean is different from the others, indicating a significant effect of the independent variable.

INTRODUCTION TO ONE — WAY ANOVA :

One-way analysis of variance (ANOVA) is a statistical method for testing for differences in the means of three or more groups. One-way ANOVA is commonly applied when you want to examine the effect of a single independent variable, also known as a factor, on a dependent variable. The purpose is to determine whether different levels or variations of that factor lead to significant changes in the dependent variable.

One-way ANOVA (Analysis of Variance) is a statistical technique used to test the null hypothesis (H0) that the means of three or more populations are equal against the alternative hypothesis (H1) that at least one mean is different.

Null Hypothesis (H0):

The null hypothesis states that there are no significant differences between the group means. This implies that any observed differences in sample means are due to random variation or sampling error.

Mathematical Expression:

Alternative Hypothesis (Ha):

The alternative hypothesis states that at least one of the group means is significantly different from the others. This suggests that the differences observed in the sample means reflect true differences in the population means.

Mathematical Expression:

Assumptions of ANOVA (Analysis of Variance)

  • Independence: Observations within and between groups are independent of each other.
  • Normality: The dependent variable follows a normal distribution within each group.
  • Homogeneity of Variances: Variances of the dependent variable are equal across all groups.
  • Random Sampling: Data points are randomly sampled from the population.

Steps Involved in One-Way ANOVA to Find F-Statistic

Performing a one-way ANOVA involves several steps, from formulating hypotheses to calculating the F-statistic. Here’s a detailed outline of the process:

1. Formulate Hypotheses

Null Hypothesis (H0): The means of all groups are equal.

Alternative Hypothesis (Ha): At least one group mean is different from the others.

2. Calculate Group Means and Overall Mean

Group Mean (Xi): Calculate the mean of each group.

Overall Mean (Xˉ): Calculate the overall mean of all observations.

3. Calculate Sum of Squares

Total Sum of Squares (SST): Measures the total variability in the data.

Between-Group Sum of Squares (SSB): Measures the variability due to the differences between group means.

Within-Group Sum of Squares (SSW): Measures the variability within each group.

4. Calculate Degrees of Freedom

Between-Group Degrees of Freedom (dfB):

Within-Group Degrees of Freedom (dfW):

where k is the number of groups and N is the total number of observations.

5. Calculate Mean Squares

Mean Square Between Groups (MSB):

Mean Square Within Groups (MSW):

6. Calculate the F-Statistic

F-Statistic:

7. Compare F-Statistic to Critical Value

Determine the Critical Value: Using the F-distribution table and the degrees of freedom dfB and dfW , find the critical value at the chosen significance level (e.g., 0.05).

Decision Rule: If the calculated F-statistic is greater than the critical value, reject the null hypothesis. Otherwise, fail to reject the null hypothesis.

Implementing Statistical Analysis with Python: ANOVA and Kruskal-Wallis Tests

#Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
from scipy.stats import jarque_bera
from scipy.stats import f

class ANOVA_Oneway:
def __init__(self, data, alpha=0.05):
self.data = data
self.alpha = alpha

#Calculate the mean of each column in the DataFrame.
def calculate_group_means(self):
group_means = self.data.mean()
print("\nGroup Means:")
print("*" * 60)
print(group_means)
return group_means

#Calculate residuals by subtracting the group means from the data.
def calculate_residuals(self, group_means):
residuals = self.data.subtract(group_means, axis=1)
print("\nResiduals:")
print("*" * 60)
print(residuals)
print("*" * 60)
return residuals

#Apply the Jarque-Bera test to each column to assess normality.
def Jarque_Bera(self, residuals):
normality_results = {}
for column in residuals.columns:
stat, p_value = jarque_bera(residuals[column])
normality_results[column] = p_value > 0.05
print(f"Normality for {column}: {'Met' if p_value > 0.05 else 'Not met'} (p-value: {p_value})")
return normality_results

#Perform Levene's test for homogeneity of variances across groups.
def Levene_test(self):
stat, p_value = stats.levene(*[self.data[col] for col in self.data.columns])
homogeneity_met = p_value > 0.05
print("*" * 60)
print(f"\nHomogeneity of variance: {'Met' if homogeneity_met else 'Not met'} (p-value: {p_value})")
print("*" * 60)
return homogeneity_met

#Calculate the Sum of Squares Within (SSW) groups
def SSW(self):
ssw = 0
for i, j in self.data.items():
group_mean = sum(j) / len(j)
for k in j:
ssw += (k - group_mean) ** 2
return ssw

#Calculate the Sum of Squares Between (SSB) groups
def SSB(self):
ssb = 0
total = sum([len(values) for values in self.data.values])
overall_mean = sum([sum(values) for values in self.data.values]) / total
for i, j in self.data.items():
group_mean = sum(j) / len(j)
ssb += len(j) * (group_mean - overall_mean) ** 2
return ssb

#Calculate the degrees of freedom for the Sum of Squares Between (SSB).
def df_SSB(self):
return self.data.shape[1] - 1

#Calculate the degrees of freedom for the Sum of Squares Within (SSW).
def df_SSW(self):
return (self.data.shape[0] - 1) * self.data.shape[1]

#Calculate the Mean Square Within (MSW) groups.
def MSW(self):
msw = self.SSW() / self.df_SSW()
return msw

#Calculate the Mean Square Between (MSB) groups.
def MSB(self):
msb = self.SSB() / self.df_SSB()
return msb

#Calculate the F-statistic for the data.
def F_statistic(self):
return self.MSB() / self.MSW()

#Perform the hypothesis test for ANOVA and interpret the result.
def re(self):
p_value = 1 - f.cdf(self.F_statistic(), self.df_SSB(), self.df_SSW())
if p_value < self.alpha:
print("We reject the null hypothesis (significant differences between groups).")
else:
print("Fail to reject the null hypothesis (no significant differences between groups).")
return p_value

#Determine and perform the appropriate test (ANOVA or Kruskal-Wallis) based on data assumptions.
def perform_anova_or_kruskal(self):
group_means = self.calculate_group_means()
residuals = self.calculate_residuals(group_means)
normality_results = self.Jarque_Bera(residuals)
homogeneity_met = self.Levene_test()
print(homogeneity_met)
print(normality_results.values())

all_normality_met = all(normality_results.values())
print( all_normality_met)

# Check Jarque-Bera test results
if all_normality_met:
print("\nAll groups meet the normality assumption (Jarque-Bera test).")
print("*" * 60)

else:
print("\nNot all groups meet the normality assumption (Jarque-Bera test).")
print("*" * 60)

# Check Levene's test result
if homogeneity_met:
print("\nHomogeneity of variance is met (Levene's test).")
print("*" * 60)
else:
print("\nHomogeneity of variance is not met (Levene's test).")
print("*" * 60)

# Determine which test to use based on the conditions
if all_normality_met and homogeneity_met:
print("\nAll assumptions met. Performing custom ANOVA calculation...")
print("*" * 60)
f_stat = self.F_statistic()
p_value = self.re()
print(f"ANOVA results: F-statistic = {f_stat}, p-value = {p_value}")
print("*" * 60)
return 'ANOVA', f_stat, p_value
else:
print("\nAssumptions not fully met. Performing Kruskal-Wallis test...")
print("*" * 60)
h_stat, p_value = stats.kruskal(*[self.data[col] for col in self.data.columns])
print(f"Kruskal-Wallis results: H-statistic = {h_stat}, p-value = {p_value}")
print("*" * 60)
return 'Kruskal-Wallis', h_stat, p_value


# Usage example with CSV data
df2 = pd.read_csv(r"C:\Users\RAGHAVENDRA KUMAR\Downloads\annovaSample.csv")
df2.drop("Unnamed: 0", axis=1, inplace=True)

#Creating ANOVA_Oneway object
anova_test = ANOVA_Oneway(df2)

#Checking whether the data is suitable to perform Anova or noy
anova_test.perform_anova_or_kruskal()

Output:

Explanation:

  • Imports: The code imports necessary libraries such as pandas, numpy, matplotlib, seaborn, and scipy.stats for data manipulation, plotting, statistical tests, and ANOVA calculations.
  • Class ANOVA_Oneway: This class is initialized with data (self.data) and an alpha significance level (self.alpha).
  • Initialization (__init__): Sets up the data and alpha value for hypothesis testing.
  • Mean Calculation (calculate_group_means): Computes the mean for each group in the data.
  • Residual Calculation (calculate_residuals): Calculates residuals by subtracting group means from the data.
  • Normality Test (Jarque_Bera): Uses the Jarque-Bera test to check normality assumption for each group's residuals.
  • Homogeneity Test (Levene_test): Performs Levene's test to check homogeneity of variances across groups.
  • Sum of Squares Within (SSW): Computes the sum of squares within groups.
  • Sum of Squares Between (SSB): Computes the sum of squares between groups.
  • Degrees of Freedom (df_SSB and df_SSW): Calculates degrees of freedom for between and within groups.
  • Mean Squares (MSB and MSW): Computes mean squares between and within groups.
  • F-Statistic (F_statistic): Calculates the F-statistic for ANOVA.
  • Hypothesis Testing (re): Performs the ANOVA hypothesis test and interprets the result based on the F-statistic and degrees of freedom.
  • Perform ANOVA or Kruskal-Wallis (perform_anova_or_kruskal): Checks assumptions of normality and homogeneity of variance. If assumptions are met, performs ANOVA; otherwise, performs Kruskal-Wallis test.

Conclusion:

In summary, this blog has provided a practical guide to performing one-way ANOVA using Python. By implementing a custom ANOVA_Oneway class, we calculated group means, assessed data assumptions, and conducted hypothesis testing to determine significant group differences. This approach equips data professionals with the tools to confidently analyze and interpret data, fostering informed decision-making in various analytical contexts.

--

--