ANOVA: The Basics

Last week I walked you through a multiple regression analysis. This week is a little different because Analysis of Variance (ANOVA) is a little more complicated in my book. I’m going to do my best to explain what ANOVA is, what assumptions are, and how to conduct an ANOVA using R. Similar to last week’s post, I added the links to all of the websites I referenced while interpreting the data analysis.

Let’s start with the basics. What is ANOVA?

A one-way ANOVA is used to determine if there is any significance between the means of compared groups. The compared groups must consist of three or more unrelated groups from your data set (Ralph, 2010). (You can find data sets to practice with here. We will be using this one today.)

##We will use this data set from R today##
> data("InsectSprays")
> attach(InsectSprays)
##I am going to follow this example throughout the blog post today##

An ANOVA will tell you whether or not there is a significant difference between at least two groups, but it will not tell you which two groups were different. In order to determine the groups, you need to run a post-hoc test (more on that later).

SIDE NOTE: You may be wondering why we aren’t using a T-test to compare the groups. The reason here is an ANOVA controls for Type-1 error. Running a T-test would increase the error. ANOVA maintains the error at 5%. For more information on Type-1 (and Type-2) errors, click here.

OK, back to the ANOVA.

##Since an ANOVA works to compare the means, let's find the means##
##There are 6 groups in this data set (A, B, C, D, E, F)
##Find the mean--> you'll need to do this for each group (A, B, ...)##
> mean(count[spray=="A"])
##View all##
> tapply(count, spray, mean)
14.500000 15.333333 2.083333 4.916667
3.500000 16.666667
##We can also view the sample size##
> tapply(count, spray, length)
12 12 12 12 12 12

So far we have gathered the means for each group and assessed the data, it looks good. However, multiple references (1, 2, 3) recommend creating a box plot to further investigate the data. Let’s do this.

##Create a box plot##
##The sample I am following also provided R commands to control for/change the order, unfortunately this did not work for me. It is not detrimental for this example it, but it is important to know for your own data analysis##
##We can see that there are differences between the groups##

Now, it’s time to conduct an ANOVA to determine what all of this means.

First, let’s run a 1-way ANOVA.

> oneway.test(count~spray)
One-way analysis of means (not assuming equal variances)
data:  count and spray
F = 36.065, num df = 5.000, denom df = 30.043, p-value = 7.999e-12

Next, let’s use the aov.out command

> aov.out = aov(count~spray, data=InsectSprays)
> summary(aov.out)
Df Sum Sq Mean Sq F value Pr(>F)
spray 5 2669 533.8 34.7 <2e-16 ***
Residuals 66 1015 15.4
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
## F(5,66) = 34.7;p <.000

We now need to run a Post-Hoc test (I told you we would come back to this). Let’s determine what groups had a significant difference.

##Multiple Comparisons##
> TukeyHSD(aov.out)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = count ~ spray, data = InsectSprays)
diff lwr upr p adj
B-A 0.8333333 -3.866075 5.532742 0.9951810
C-A -12.4166667 -17.116075 -7.717258 0.0000000
D-A -9.5833333 -14.282742 -4.883925 0.0000014
E-A -11.0000000 -15.699409 -6.300591 0.0000000
F-A 2.1666667 -2.532742 6.866075 0.7542147
C-B -13.2500000 -17.949409 -8.550591 0.0000000
D-B -10.4166667 -15.116075 -5.717258 0.0000002
E-B -11.8333333 -16.532742 -7.133925 0.0000000
F-B 1.3333333 -3.366075 6.032742 0.9603075
D-C 2.8333333 -1.866075 7.532742 0.4920707
E-C 1.4166667 -3.282742 6.116075 0.9488669
F-C 14.5833333 9.883925 19.282742 0.0000000
E-D -1.4166667 -6.116075 3.282742 0.9488669
F-D 11.7500000 7.050591 16.449409 0.0000000
F-E 13.1666667 8.467258 17.866075 0.0000000
##This table shows the comparisons between each group##
##summary will give us an overview##
> summary.lm(aov.out)
aov(formula = count ~ spray, data = InsectSprays)
Min 1Q Median 3Q Max
-8.333 -1.958 -0.500 1.667 9.333
Estimate Std. Error t value Pr(>|t|)
(Intercept) 14.5000 1.1322 12.807 < 2e-16 ***
sprayB 0.8333 1.6011 0.520 0.604
sprayC -12.4167 1.6011 -7.755 7.27e-11 ***
sprayD -9.5833 1.6011 -5.985 9.82e-08 ***
sprayE -11.0000 1.6011 -6.870 2.75e-09 ***
sprayF 2.1667 1.6011 1.353 0.181
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.922 on 66 degrees of freedom
Multiple R-squared: 0.7244, Adjusted R-squared: 0.7036
F-statistic: 34.7 on 5 and 66 DF, p-value: < 2.2e-16
## F(5,66) = 34.7;p <.000##
##we have significance##
##We can use the Bartlett.test to determine the weight of variance##
bartlett.test(count~spray, data = InsectSprays)
Bartlett test of homogeneity of variances
data:  count by spray
Bartlett's K-squared = 25.96, df = 5, p-value =
##to analyze the results you would want to use the p-value, this article is a great reference to understand Bartlett.test##
##Visualizations of group differences##
> plot(aov.out)
Hit <Return> to see next plot: #if you continue to hit return, R will produce comparison visuals for your data

Now let’s test assumptions. What are assumptions?

ANOVA makes three main assumptions, (1) the dependent variable is normally distributed, (2) there is homogeneity of variances, and (3) independence of observation. For more information on the assumptions click here, for more information on independence of assumptions, click here. If your data fails the assumptions it’s OK, there are solutions here.You can follow this link to learn more about assessing assumptions.

##testing assumptions##
with(PlantGrowth, tapply(weight, group, mean))
with(PlantGrowth, tapply(weight, group, var))
with(PlantGrowth, bartlett.test(weight~group))
lm.out=with(PlantGrowth, lm(weight~group))

Interpreting the graphs

Residuals vs Fitted: If residuals bounce randomly around line 0 we can safely assume the relationship is linear. If the residuals form somewhat of a horizontal line around the 0 line we can assume variances of error terms are equal. If no residuals stands out, we can assume there are no outliers. You can find more information on residuals vs fitted models here.

Normal Q-Q: The normality of the residuals strays from the theoretical line, the pattern strays from the expectation. More information on interpreting Normal Q-Q plots here.

Scale-Location Plot: This shows how the data is spread. You can check the assumption of equal various through this plot. Here is more information on Scale-Location plots.

Constant Leverage: This plot shows us how influential certain data points are, and can help you determine if your results were heavily influences by one data point. More information on constant leverage here.

Final Thoughts

Running an ANOVA in R is a tedious task. It seems that there are inconsistencies between online sources which made completing an ANOVA more challenging. I found various codes and explanations for steps of ANOVA throughout my research which you likely found if you clicked on the additional links throughout this post.

While running an ANOVA in R was beneficial, in the future, I think I would prefer to run an ANOVA using SPSS. Mostly because the results are much more clear and interpretable than in R.

Here is a link to a video tutorial for a few different R analysis. I didn’t use the tutorials for this assignment but I think many of you will find it helpful in the future.

Additional References

AERD Statistics (n.d.). One-way ANOVA. Retrieved from

Ralph (2010). One-way analysis of variance (ANOVA). Retrieved from

Like what you read? Give Kyrsten Novak a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.