You Know What Happens When You Assume…

Earl Radina
Human Systems Data
Published in
4 min readApr 4, 2017

…you make improper statistical assumptions and become the laughing stock of the scientific community and are forced to leave your university and get a job at FedEx and every holiday your parents don’t say anything but they heavily imply that you wasted your potential.

Personal anecdotes aside, ANOVA and the like are not NEW to me, but they are definitely unfamiliar. Last time I even discussed them was back in 2012 when I was fresh out of high school and in a lower level Research Methods course. And even then it was only in passing and any detail was only deemed necessary if your group project required you to perform one. So I decided to take this opportunity to teach myself the things I probably should have learned up to this point. Consider it a brief crash course in ANOVA. Something I may even be able to reference back one day in the future.

My initial data analysis looked probably similar to anyone else’s. This was performed just to see if I could do it:

> datafilename="http://personality-project.org/r/datasets/R.appendix2.data"
> data.example2=read.table(datafilename,header=T)
> data.example2
> aov.ex2 = aov(Alertness~Gender*Dosage,data=data.example2)
> summary(aov.ex2)
Df Sum Sq Mean Sq F value Pr(>F)
Gender 1 76.56 76.56 2.952 0.111
Dosage 1 5.06 5.06 0.195 0.666
Gender:Dosage 1 0.06 0.06 0.002 0.962
Residuals 12 311.25 25.94
> print(model.tables(aov.ex2,"means"),digits = 3)
Tables of means
Grand mean

14.0625
Gender
Gender
f m
16.25 11.88
Dosage
Dosage
a b
13.50 14.62
Gender:Dosage
Dosage
Gender a b
f 15.75 16.75
m 11.25 12.50

Given enough time most anybody will be able to tell you what’s going on here. But what’s happening underneath? To reference my previous post, R will not do everything for you. And while sometimes I wish it would, it’s important to know what it’s doing and why.

So let’s begin with what exactly an ANOVA is. At its most basal it’s an ANalysis Of VAriance. It compares the variation between the means of two populations. But the generalities pretty much stop there. To perform an ANOVA it’s also required that the populations be generally normally distributed. This is typically best ensured by randomizing your participants into conditions at the onset of the experiment. Or barring that, having enough subjects in each group (>30)to ensure that it’s unlikely the two groups have huge differences.

After taking a look at your subjects, it’s time to take a look at your variables. ANOVA’s require specific kinds of variables to work as intended. Your independent variable must be categorical. Meaning that each value falls into a specific category or unit of analysis. In the earlier example, that would be both gender and dose. Gender (although this should be clear that this really should be written as, “sex” given that it’s 2017), is for the purposes of this study divided into, “male” and, “female”. Two categories that don’t allow for movement between the two. Dose is also a categorical variable as it was administered as one of two levels.

This is important. So important it deserves its own little paragraph here. The biggest difference between an ANOVA and the t-test you are likely familiar with is that ANOVA’s work when there are more than two groups being compared. In the above example, there are four groups. Male dose a, Male dose b, Female dose a, Female dose b. Each is being compared both against the group as a whole (referred to as the, “grand mean”) and within themselves.

The dependent, or response, variable on the other hand, should be continuous. This is reminiscent of most data as we know it. Numbers that go in both directions endlessly with the ability to be broken down into smaller pieces.

So when is an ANOVA useful? It seems to me at this point that this would be most useful on experiments that emphasize subject variables (male/female, tall/short) as long as they are broken up into discrete categories. As well as if one is doing an experiment involving, “types” of things. Like types of carpet or types of treatments.

Now let’s do a quick aside on MANOVA’s. And no, MANOVA’s are not just ANOVA’s that come in a darker color and make $1.00 for every $0.75 an ANOVA makes. The M stands for Multivariate. These are primarily done when there are multiple dependent variables. Simple as that really. This would be done in favor of performing multiple ANOVA’s. Which is not recommended as every test you perform increases your chance of having a Type I error.

So what have we learned today? Well the TLDR of it is that ANOVA’s are nice when you have very specific conditions: Multiple levels of several categorical independent variables, and a continuous dependent variable. MANOVA’s are the same save for they look at several dependent variables as well.

Works Cited

Telke, S. [Powerpoint slides]. Retrieved from http://www.biostat.umn.edu/~susant/Fall11ph6414/Section13_Part1.pdf

Understanding MANOVA. Retrieved from http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/anova/basics/understanding-manova/

What is ANOVA?. Retrieved from http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/anova/basics/what-is-anova/

--

--