F-tests and ANOVAs — Examples with the Iris dataset

I had a colleague ask me recently what the purpose of an F-test was, and I wanted to provide him with some additional resources with my answer. After diving in a bit deeper on some of the subjects, I thought a post around the assumptions of the F-test and the digesting of its results, would provide a better answer.

One-Sided F-test

Just as you would perform a t-test to determine if a sample mean (test group) came from another distribution (control group) with the same mean, an F-test can compare the means of various groups and determine if they are equal by looking at their variances. We can do this be checking the variations between the groups and within the groups.

Consider a group of three samples means for sepal_width_cm from the Iris dataset. Our goal is to determine that each group’s mean value are statistically different from the other’s, and to do this we need to evaluate the variability between each of the mean values.

Image for post
Image for post
Different mean lengths for different flowers in the IRIS dataset

From the above we want to create a null and alternative hypothesis, to check the inequality of the means.

By rejecting our null (H0) we know there are least two group means that are statistically significantly different from each other. Let’s take a look at those values.

Now that we have some baseline numbers to work with, let’s consider the statistic we need to evaluate the null hypothesis. Unsurprisingly, this is called the F-statistic and can be thought of like this.

If our null hypothesis is correct that all the means of the groups are equal, the numerator and denominator should be roughly close to the same and the statistic should be near 1.0. And, with a higher ratio — when the F-statistic is far greater than 1.0 — implies the samples are drawn from populations with quite different mean values.

Assumptions of the F-test

Before we jump into the F-test make sure a few assumptions are met from the data, as this test is highly sensitive to non-normal data. Echoing Glen_b’s post here, if your data was drawn from heavier or lighter tailed groups the F-statistic would give you incorrect results.

This is because the sampling distribution of a sample variance is directly affected by kurtosis of the underlying distribution. The flatter or skinnier our sample distribution, the more likely the chi-squared distribution will provide an incorrect answer.

Glen_b’s post does a great job of explaining more of the underlying assumptions with sample variance, which I highly suggest going over for a more technical understanding why non-normal data affects the test.

Back to our assumptions, it is necessary for these properties to hold when performing an F-test.

  1. Data is normally distributed;
  2. The samples are independent of one another, and;
  3. The population standard deviations of the groups are homoscedastic.

Let’s test these assumptions on sepal_width_cm for each target using a Shapiro-Wilks test to look at the normality of our data. If our p-value is > .05, we would reject the normality of our data.

Nothing here has been rejected, so we can keep our first assumption held.

Let’s look at the QQ-plots for some visual representation of our data.

Image for post
Image for post
Image for post
Image for post
Image for post
Image for post
Have some slight skewing but the visual check is looking good

The QQ plots above are each forming lines that are roughly straight. Just as a quick refresher, the QQ-Plot checks the theoretical points of a normal distribution vs. the actual points for each of our flowers. If they are drawn from a normal distribution, the points should lie on the red line. While there are some deviations around the tails for setosa and virginica, the points are mostly along the curve. This is suggesting our data came from a normal distribution, but with some variance around the tails. The results are backing up our Shapiro-Wilks test from above.

Finally let’s check for homoscedasticity using a Bartlett test. If we reject the null hypothesis, we can also reject the assumption of homoscedasticity.

What about our Levene test, which has the same assumption as Bartlett.

Both of these test have failed to reject the null hypothesis that our samples are homoscedastic.

Keep in mind, the Bartlett test can provide incorrect results if the data is non-normal, while Levene’s test is more robust against that type of data. This is why it is possible to get different results from the two tests.

On To Our F-test

Now that we have validated our assumptions, we can finally run the F-test. We’ll be using the scipy package f_oneway to test our null hypothesis that two or more groups have the same population mean.

We need to evaluate the F-Statistic against a critical point on the F-Distribution to determine if our result is significant. We’ll get to a more detailed example of how to evaluate that when we get to the ANOVA test.

Our F-statistic of 47.36 suggests the between-groups variance is 47x the within-group variance. Now we know the ratio of our variances doesn’t equal one, and the null hypothesis of equal mean values is rejected due to the p value being < 0.05.


Let’s quickly go over the F-distribution as we’re using it to evaluate our F-statistic. We’re going to be evaluating the F-statistic at a 95% CI, and to determine the shape of our distribution we need the dfn (degrees of freedom numerator) and dfd (degrees of freedom denominator).

Were k is the number of comparison groups — in our case three — and N is the total number of observations in the analysis, which is 150.

So for our data, the rejection decision is: Reject H0 in favour of H1 if statistic > 3.06. And, from our work previously, we know our statistic is quite large so we will reject H0.

Image for post
Image for post
Our F-Distribution with the rejection region

It’s also important to note, when performing an F-test you’ll ever perform a single sided F-test and not a two sided test. The F-statistic will always be positive, as it is not possible for it to be negative as it is computed from the ratio of two variances. Even if the variance within groups number was quite large, the resulting F-statistic would be close — but not below — zero.

Why Not Multiple t-tests?

If we were comparing just one set of flowers in our Iris dataset to a single grouping of the other two, we could still use a t-test at a 95% CI to compare the two values. Meaning, our Type I error would be constant at 5%.

However as soon as we add another comparison and used an additional t-test to make we’re increasing the chance of a Type I error. We are adding another test which also has a 5% probability of incorrectly rejecting the null hypothesis.

Image for post
Image for post
Credit to JetLag on CrossValidated

Consider if we want to compare each of the 3 groups in our Iris dataset (setosa, versicolor, and virginica) using a t-test for each pairing. This would give us three total pairs to compare, which are:

  • Setosa vs Versicolor;
  • Setosa vs Virginica, and;
  • Versicolor vs Virginica.

Because we’re comparing 3 distinct groups with three tests, instead of just a single t-test, the probability of us calculating a result which is incorrect increase by 1-(1-.05)³ = 0.1426 instead of just 5%. So the probability of creating a Type-I error is just over 1 in 7, rather than 1 in 20.

By using an F-test / ANOVA we can control for these knock-on Type-I errors, by performing a single test. This keeps the probability of incorrectly rejecting H0 at 5%. As well, we can be more confident that any statistically significant results found are not the result of running of test after test.

Further Calculations with ANOVA

Performing f_oneway leaves much to be desired, and much of the analysis and understanding of the calculation are put in a black box.

For instance, do you know:

  • What are the sum of squares?
  • How do the coefficient values interact with each other?
  • The significance of each group?]

The Analysis of Variance (ANOVA) will assign the total variation of each independent variable, and tests each for its significance against the dependent variables. And it will illuminate some of the abstraction our one way F-test hid from us.

To do this we run a regular ols regression of with sepal_width_cm as the dependent variable, and our target flower types as the categorical independent variables.

Image for post
Image for post
What are the results telling us?

We get the same F-statistic as before — we’re still rejecting H0 so nothing has changed there — and we’re also able to break it down more, as we have more information on how our result calculated.

But we still don’t have the MSB and MSE from the ANOVA table to compute the F-statistic ourself. For that we’ll need anova_lm from the stats package.

Nice. Now we know how to correctly compute this statistic.

Further Analysis

Let’s take a look at coef for each of our target flowers. Our setosa is 3.4 is the intercept of our model, and the base difference in mean values for our other flowers. If you check the box plot earlier in the article, you can see just how much greater it is than the other two flower types.

To see the impact of versicolor, take of the coef values and difference them3.418 + (-.648) = 2.77 which is quite similar to the mean value we calculated earlier. The same can be done for virginica.

Furthermore, the ANOVA reports the Durban-Watson (DW) result at 1.858. This determines if our errors are normally distributed with a mean of 0, and if the errors are stationary. If trend were to occur within the errors, the previous two assumptions would be violated, and we would need to reevaluate our model. Generally if our DW stat is not 1.5 < DW < 2.5 we should look into if the assumptions were violated, but in this case we’re fine as our result is within this range.

The Cond. No. or Condition Number checks for multicollinearity within our data if the results is over 20. Luckily the results from statsmodels will print this warning with your output, so just keep an eye out for this. Our result is 3.73 so we don’t need to worry.

Finally, the Jarque-Bera (JB) number further checks our previous assumption of normality with our data. I know we went through this before with the Shapiro-Wilks test, but this is one final check. As the JB number trends to 0, the more our data is assumed to be normally distributed with a skew of 0 and a kurtosis of 0. Our number is 1.565, which rejects the normality assumption.

Is all the work we did previously wrong?

Well it seems the JB test actually does not perform well on small sample sizes, so take this result with some skepticism.

Further Questions to Think About

Now that we’ve gone through the analytical process of determining how an F-test works, think about these questions.

Is a negative F-statistic possible?

No. Because the F-statistic is the ratio of two error terms, that are themselves the result of a sum of squares it would not be possible to get a negative F-statistic. It would be possible to get a 0, if one of the values was perfect and produced no errors.

What does a very low F-statistic mean?

It implies there is a low variability between the means of each group, relative to the variability within each group. It would be harder to reject the null hypothesis of different means, as our critical value would most likely be larger than our F-statistic.

What does a high F statistic mean?

Opposite from the above, the variability between group means is much larger than within the group. It will be much easier to reject the null hypothesis that the group means are equal, as our F-statistic will be quite high.

Where Do I Go From Here?

Post hoc tests (posteriori tests) should be used to confirm coef values from our target groups, but only the overall model was significant and ours was.

Here are just a few post hoc test we could perform, to further verify our results:

  • Bonferroni Procedure / Bonferroni Type Adjustment;
  • Tukey’s Honestly Significant Difference;
  • Fisher’s Least Significant Difference;
  • Newman-Keuls Method;
  • Dunnett’s Multiple Comparison, and;
  • Duncan Multiple Range.

I’m not going to go over each of them in detail, but I do want to quickly go over the Bonferroni Procedure.

We want to test for alpha inflation with our results — also known as the familywise error rate or FWER — such that we may have an increased probability of coming to at least one false conclusion from our hypothesis testing.

Given we have declare tests significant for our p-value < .05, we need to update our alpha level such that it is α / k. so, any post hoc analysis needs to have a p-value below this level.

Now that we have our updated alpha value, we can go back to the ANOVA, get our original p-values and do our post hoc check using the Bonferroni Procedure to see if we can still trust our results.

Which tells us again, after this adjustment, all the values of our groupings are still considered significant.

Final Thoughts

That went on for far longer than I expected, and I hope it showed all the insight into conducting an F-test and the additional points to consider. There is more that goes into an F-test than simply performing the test and it is important to consider the evaluation of your testing while you’re performing it. Or else, you’re bound to fall into some common pitfalls and incorrect conclusions.

As always, I hope you’ve found this helpful and learned something new.

Additional Reading

Sometimes Chef ◦ Sometimes Data Scientist ◦ Sometimes Developer

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store