ANOVA from a Machine Learning Perspective

Cameron Wasson
3 min readJun 28, 2024

--

It is no secret that statistical analysis is a key pillar of machine learning’s foundation. Often times, machine learning is most effective when feature data is preprocessed using various statistical mechanisms, yielding more accurate models.

An incredibly useful statistics-based process for optimizing a model’s feature space is Analysis Of Variance; ANOVA for short. ANOVA utilizes multidimensional variance to compute a unitless metric that expresses the variability of the dataset — dubbed “f-score”. When a dataset’s f-score is MAXIMIZED, the dataset is deemed highly variable; making it easier for the classifier to distinguish feature observations as their respective class/label values.

The f-score is produced by computing the ratio of the mean sum of the squares between classes and the mean sum of the squares within each class, which can be pictorially represented by the following PDF curves:

Within Group: Variance of the Class with respect to itself | Between Group: Variance of the Classes with respect to each other

Mathematically, this can be represented by the following:

Since this is largely composed of straightforward arithmetic, we can easily write a function in your favorite programming language to compute this process. I wrote the following code in Python, using the NumPy library:

import numpy as np

def anova_1d(values, labels):
# get label values and compute degrees of freedom (DoF)
label_values = np.unique(labels)
dof_between, dof_within = (label_values.shape[0] - 1, labels.shape[0] - label_values.shape[0])

# calculate per class sum of squares
ss_between, ss_within = 0, 0
for i in range(label_values.shape[0]):
# grab indices for this class/group/label value
lab_idx = (labels == label_values[i])

# calculate sum of the squares of feature values between classes
ss_between += len(values[lab_idx]) * np.square(np.abs(np.mean(values[lab_idx]) - np.mean(values)))

# calculate sum of the squares of feature values within classes
ss_within += np.sum(np.square(np.abs(values[lab_idx] - np.mean(values[lab_idx]))))

# compute F score
f_score = (ss_between / dof_between) / (ss_within / dof_within)
return f_score

With our one-way ANOVA function written, let’s examine some simple examples on how we can interpret the f-score!

Below, we will create a dataset with two classes that will be very variable; meaning each class’s relative mean will be far from each other. As you can see, it produces the VERY high f-score of 51367.76; meaning, a machine learning classifier would quite easily learn how classify this dataset.


import numpy as np

# create first dataset
data1 = np.random.randint(10, 20, size=(1000,))
label1 = np.ones_like(data1)
# create second dataset
data2 = -1*data1
label2 = -1*label1
# combine into one dataset
data = np.hstack((data1, data2))
label = np.hstack((label1, label2))
# compute ANOVA, print f-score
f_score = anova_1d(data, label)
print(f_score) # 51367.76

In most cases, we will not see datasets this variable. Noise and conflicting datapoints are all too common in most data, as shown by the below example which creates two closely spaced gaussian distributions:

np.random.seed(0)

# create first dataset
data1 = np.random.normal(.1, 1, 10000) # gaussian with peak at 0.1
label1 = np.ones_like(data1)
# create second dataset
data2 = np.random.normal(-.1, 1, 10000) # gaussian with peak at -0.1
label2 = -1*np.ones_like(data2)
# combine into one dataset
data = np.hstack((data1, data2))
label = np.hstack((label1, label2))
# compute ANOVA, print f-score
f_score = ms.anova_1d(data, label)
print(f_score) # 148.3311

To visualize the overlap in these two gaussians, we can plot their histograms:

import matplotlib.pyplot as plt

plt.hist(data1, bins=50)
plt.hist(data2, bins=50, alpha=.75)

By utilizing ANOVA, a machine learning scientist can employ a straightforward mathematical process to express the variance of their dataset by producing the f-score. This is especially useful in large datasets with complicated machine learning models.

While the ANOVA process takes <1 second to run in most cases, training a machine learning model can take minutes to hours depending on the size of the model and dataset. Utilizing ANOVA is short way of assess your feature space’s learning ability, as well as helping to fine-tune any feature space hyperparameters to maximize each feature’s f-score.

I hope you, the reader, can add something new into your machine learning pipelines to produce more efficacious models!

--

--