# F-tests and ANOVAs — Examples with the Iris dataset

I had a colleague ask me recently what the purpose of an F-test was, and I wanted to provide him with some additional resources with my answer. After diving in a bit deeper on some of the subjects, I thought a post around the assumptions of the F-test and the digesting of its results, would provide a better answer.

**One-Sided F-test**

Just as you would perform a t-test to determine if a sample mean (test group) came from another distribution (control group) with the same mean, an F-test can compare the means of various groups and determine if they are equal by looking at their variances. We can do this be checking the variations *between* the groups and *within* the groups.

Consider a group of three samples means for `sepal_width_cm`

from the `Iris`

dataset. Our goal is to determine that each group’s mean value are statistically different from the other’s, and to do this we need to evaluate the variability between each of the mean values.

From the above we want to create a null and alternative hypothesis, to check the inequality of the means.

`H0 = x_bar1 = x_bar2 = x_bar3`

H1 = x_bar1 != x_bar2 != x_bar3

By rejecting our null (`H0`

) we know there are least two group means that are statistically significantly different from each other. Let’s take a look at those values.

# Get names of each feature for later

names = iris.get('target_names')# What is the overall mean?

print df.groupby('target').mean()>> sepal_width_cm 5.843333# What are the mean of each group?

df.groupby('target').mean()>> sepal_width_cm

target

setosa 3.418

versicolor 2.770

virginica 2.974# Check the overall mean

print df.mean()sepal_width_cm 3.054000

Now that we have some baseline numbers to work with, let’s consider the statistic we need to evaluate the null hypothesis. Unsurprisingly, this is called the F-statistic and can be thought of like this.

`F-statistic = variance between groups / variance within groups`

If our null hypothesis is correct that all the means of the groups are equal, the numerator and denominator should be roughly close to the same and the statistic should be near 1.0. And, with a higher ratio — when the F-statistic is far greater than 1.0 — implies the samples are drawn from populations with quite different mean values.

**Assumptions of the F-test**

Before we jump into the F-test make sure a few assumptions are met from the data, as this test is highly sensitive to non-normal data. Echoing Glen_b’s post here, if your data was drawn from heavier or lighter tailed groups the F-statistic would give you incorrect results.

This is because the sampling distribution of a sample variance is directly affected by kurtosis of the underlying distribution. The flatter or skinnier our sample distribution, the more likely the chi-squared distribution will provide an incorrect answer.

Glen_b’s post does a great job of explaining more of the underlying assumptions with sample variance, which I highly suggest going over for a more technical understanding why non-normal data affects the test.

Back to our assumptions, it is necessary for these properties to hold when performing an F-test.

- Data is normally distributed;
- The samples are independent of one another, and;
- The
*population*standard deviations of the groups are homoscedastic.

Let’s test these assumptions on `sepal_width_cm`

for each `target`

using a Shapiro-Wilks test to look at the normality of our data. If our p-value is > .05, we would reject the normality of our data.

for name in names:

print "{}: {}".format(name, stats.shapiro(df['sepal_width_cm'][df['target'] == name]))setosa: (0.968691885471344, 0.20465604960918427) # Not rejected

versicolor: (0.9741330742835999, 0.33798879384994507) # Not rejected

virginica: (0.9673910140991211, 0.1809043288230896) # Not rejected

Nothing here has been rejected, so we can keep our first assumption held.

Let’s look at the QQ-plots for some visual representation of our data.

The QQ plots above are each forming lines that are roughly straight. Just as a quick refresher, the QQ-Plot checks the theoretical points of a normal distribution vs. the actual points for each of our flowers. If they are drawn from a normal distribution, the points should lie on the red line. While there are some deviations around the tails for setosa and virginica, the points are mostly along the curve. This is suggesting our data came from a normal distribution, but with some variance around the tails. The results are backing up our Shapiro-Wilks test from above.

Finally let’s check for homoscedasticity using a Bartlett test. If we reject the null hypothesis, we can also reject the assumption of homoscedasticity.

stats.bartlett(df['sepal_width_cm'][df['target'] == names[0]],

df['sepal_width_cm'][df['target'] == names[1]],

df['sepal_width_cm'][df['target'] == names[2]])# Results

BartlettResult(statistic=2.2158125491551637, pvalue=0.3302496898960959)

What about our Levene test, which has the same assumption as Bartlett.

stats.levene(df['sepal_length_cm'][df['target'] == names[0]],

df['sepal_length_cm'][df['target'] == names[1]],

df['sepal_length_cm'][df['target'] == names[2]])# Result

LeveneResult(statistic=0.6475222363405327, pvalue=0.5248269975064537)

Both of these test have failed to reject the null hypothesis that our samples are homoscedastic.

Keep in mind, the Bartlett test can provide incorrect results if the data is non-normal, while Levene’s test is more robust against that type of data. This is why it is possible to get different results from the two tests.

**On To Our F-test**

Now that we have validated our assumptions, we can finally run the F-test. We’ll be using the `scipy`

package `f_oneway`

to test our null hypothesis that two or more groups have the same population mean.

We need to evaluate the F-Statistic against a critical point on the F-Distribution to determine if our result is significant. We’ll get to a more detailed example of how to evaluate that when we get to the ANOVA test.

import scipy.stats as stats

stats.f_oneway(df['sepal_length_cm'][df['target'] == names[0]],

df['sepal_length_cm'][df['target'] == names[1]],

df['sepal_length_cm'][df['target'] == names[2]])F_onewayResult(statistic=47.36446140299382, pvalue=1.3279165184572242e-16)

Our F-statistic of 47.36 suggests the between-groups variance is 47x the within-group variance. Now we know the ratio of our variances doesn’t equal one, and the null hypothesis of equal mean values is rejected due to the p value being `< 0.05`

.

**F-Distribution**

Let’s quickly go over the F-distribution as we’re using it to evaluate our F-statistic. We’re going to be evaluating the F-statistic at a 95% CI, and to determine the shape of our distribution we need the dfn (degrees of freedom numerator) and dfd (degrees of freedom denominator).

# F-distribution parametersdfn = k - 1 = 3 - 1 = 2

dfd = N - k = 150 - 3 = 147

Were `k`

is the number of comparison groups — in our case three — and `N`

is the total number of observations in the analysis, which is `150`

.

So for our data, the rejection decision is: Reject `H0`

in favour of `H1`

if statistic > 3.06. And, from our work previously, we know our statistic is quite large so we will reject `H0`

.

from scipy.stats import f, normdef plot_f_distrubiton():

# Set figure

plt.figure(figsize=(12, 8))

# Set degrees of freedom

dfn, dfd = 2, 147

rejection_reg = f.ppf(q=.95, dfn=dfn, dfd=dfd)

mean, var, skew, kurt = f.stats(dfn, dfd, moments='mvsk')

x = np.linspace(f.ppf(0.01, dfn, dfd),

f.ppf(0.99, dfn, dfd), 100)

# Plot values

plt.plot(x, f.pdf(x, dfn, dfd), alpha=0.6,

label=' X ~ F({}, {})'.format(dfn, dfd))

plt.vlines(rejection_reg, 0.0, 1.0,

linestyles="dashdot", label="Crit. Value: {:.2f}".format(rejection_reg))

plt.legend()

plt.title('F-Distribution dfn:{}, dfd:{}'.format(dfn, dfd))plot_f_distrubiton();

It’s also important to note, when performing an F-test you’ll ever perform a single sided F-test and not a two sided test. The F-statistic will always be positive, as it is not possible for it to be negative as it is computed from the ratio of two variances. Even if the `variance within groups`

number was quite large, the resulting F-statistic would be close — but not below — zero.

**Why Not Multiple t-tests?**

If we were comparing just *one* set of flowers in our `Iris`

dataset to a *single* grouping of the other two, we could still use a t-test at a 95% CI to compare the two values. Meaning, our Type I error would be constant at 5%.

However as soon as we add another comparison and used an additional t-test to make we’re increasing the chance of a Type I error. We are adding another test which *also *has a 5% probability of incorrectly rejecting the null hypothesis.

Consider if we want to compare each of the 3 groups in our Iris dataset (`setosa`

, `versicolor`

, and `virginica`

) using a t-test for each pairing. This would give us three total pairs to compare, which are:

- Setosa vs Versicolor;
- Setosa vs Virginica, and;
- Versicolor vs Virginica.

Because we’re comparing 3 distinct groups with three tests, instead of just a single t-test, the probability of us calculating a result which is incorrect increase by `1-(1-.05)³ = 0.1426`

instead of just 5%. So the probability of creating a `Type-I`

error is just over 1 in 7, rather than 1 in 20.

By using an F-test / ANOVA we can control for these knock-on `Type-I`

errors, by performing a single test. This keeps the probability of incorrectly rejecting `H0`

at 5%. As well, we can be more confident that any statistically significant results found are not the result of running of test after test.

**Further Calculations with ANOVA**

Performing `f_oneway`

leaves much to be desired, and much of the analysis and understanding of the calculation are put in a black box.

For instance, do you know:

- What are the sum of squares?
- How do the coefficient values interact with each other?
- The significance of each group?]

The Analysis of Variance (`ANOVA`

) will assign the total variation of each independent variable, and tests each for its significance against the dependent variables. And it will illuminate some of the abstraction our one way F-test hid from us.

To do this we run a regular `ols`

regression of with `sepal_width_cm`

as the dependent variable, and our `target`

flower types as the categorical independent variables.

import statsmodels.api as sm

from statsmodels.formula.api import olsresults = ols('sepal_width_cm ~ C(target)', data=df).fit()

print results.summary()

# Let's pull out the really important numbersF(df model, df residuals) = F(2, 147) = F-stat = 47.36

Prob (F-stat): 1.33e-16

We get the same F-statistic as before — we’re still rejecting `H0`

so nothing has changed there — and we’re also able to break it down more, as we have more information on how our result calculated.

F-statistic = variance between groups / variance within groupsMean Squared Between = MSB / (k-1)

Mean Squared Error = MSE / (n-k)MSB = sum of the squared error between each group mean and the overall meanMSE = sum of the squared error between each observation and its own group meank = number of groups (degrees of freedom)

n = number of observations

But we still don’t have the `MSB`

and `MSE`

from the `ANOVA`

table to compute the `F-statistic`

ourself. For that we’ll need `anova_lm`

from the `stats`

package.

aov_table = sm.stats.anova_lm(results, typ=2)

print aov_table sum_sq df F PR(>F)

C(target) 10.9776 2.0 47.364461 1.327917e-16

Residual 17.0350 147.0 NaN NaN

MSB = 10.9776 / 2.0

MSE = 17.0350 / 147.0# Calculate our F-statistic

print (10.9776 / 2) / (17.0350/147.0)>> 47.364461403

Nice. Now we know how to correctly compute this statistic.

**Further Analysis**

Let’s take a look at `coef`

for each of our target flowers. Our `setosa`

is `3.4`

is the intercept of our model, and the base difference in mean values for our other flowers. If you check the box plot earlier in the article, you can see just how much greater it is than the other two flower types.

To see the impact of `versicolor`

, take of the coef values and difference them`3.418 + (-.648) = 2.77`

which is quite similar to the mean value we calculated earlier. The same can be done for `virginica`

.

Furthermore, the `ANOVA`

reports the Durban-Watson (`DW`

) result at `1.858`

. This determines if our errors are normally distributed with a mean of 0, and if the errors are stationary. If trend were to occur within the errors, the previous two assumptions would be violated, and we would need to reevaluate our model. Generally if our `DW`

stat is not `1.5 < DW < 2.5`

we should look into if the assumptions were violated, but in this case we’re fine as our result is within this range.

The Cond. No. or Condition Number checks for multicollinearity within our data if the results is over 20. Luckily the `results`

from `statsmodels`

will print this warning with your output, so just keep an eye out for this. Our result is `3.73`

so we don’t need to worry.

Finally, the Jarque-Bera (`JB`

) number further checks our previous assumption of normality with our data. I know we went through this before with the Shapiro-Wilks test, but this is one final check. As the `JB`

number trends to 0, the more our data is assumed to be normally distributed with a skew of 0 and a kurtosis of 0. Our number is `1.565`

, which rejects the normality assumption.

Is all the work we did previously wrong?

Well it seems the `JB`

test actually does not perform well on small sample sizes, so take this result with some skepticism.

**Further Questions to Think About**

Now that we’ve gone through the analytical process of determining how an F-test works, think about these questions.

Is a negative F-statistic possible?

No. Because the F-statistic is the ratio of two error terms, that are themselves the result of a sum of squares it would not be possible to get a negative F-statistic. It would be possible to get a 0, if one of the values was perfect and produced no errors.

What does a very low F-statistic mean?

It implies there is a low variability *between* the means of each group, relative to the variability *within* each group. It would be harder to reject the null hypothesis of different means, as our critical value would most likely be larger than our F-statistic.

What does a high F statistic mean?

Opposite from the above, the variability *between* group means is much larger than *within* the group. It will be much easier to reject the null hypothesis that the group means are equal, as our F-statistic will be quite high.

**Where Do I Go From Here?**

Post hoc tests (posteriori tests) should be used to confirm `coef`

values from our `target`

groups, but only the overall model was significant and ours was.

Here are just a few post hoc test we could perform, to further verify our results:

- Bonferroni Procedure / Bonferroni Type Adjustment;
- Tukey’s Honestly Significant Difference;
- Fisher’s Least Significant Difference;
- Newman-Keuls Method;
- Dunnett’s Multiple Comparison, and;
- Duncan Multiple Range.

I’m not going to go over each of them in detail, but I do want to quickly go over the Bonferroni Procedure.

We want to test for alpha inflation with our results — also known as the familywise error rate or `FWER`

— such that we may have an increased probability of coming to *at least* one false conclusion from our hypothesis testing.

Given we have declare tests significant for our p-value < .05, we need to update our alpha level such that it is α / k. so, any post hoc analysis needs to have a p-value below this level.

`α / k = .05 / 3.0 = 0.01667`

Now that we have our updated alpha value, we can go back to the `ANOVA`

, get our original p-values and do our post hoc check using the Bonferroni Procedure to see if we can still trust our results.

import statsmodels.stats.multitest as smmreject, pvals_corrected, alphacSidak, alphacBonf = smm.multipletests(raw_pvalues, .05, method='b')for rej, val, name in zip(reject,pvals_corrected, names):

print "{:<10}: {} {:.4} ".format(name, rej, val)# Results

setosa : True 3.533e-115

versicolor: True 1.494e-16

virginica : True 3.15e-09

Which tells us again, after this adjustment, all the values of our groupings are still considered significant.

**Final Thoughts**

That went on for far longer than I expected, and I hope it showed all the insight into conducting an F-test and the additional points to consider. There is more that goes into an F-test than simply performing the test and it is important to consider the evaluation of your testing while you’re performing it. Or else, you’re bound to fall into some common pitfalls and incorrect conclusions.

As always, I hope you’ve found this helpful and learned something new.

**Additional Reading**