ANOVA in [R] vs SPSS
what is the ANOVA?
what are its assumptions?
When should or should not be applied?
In 1918, Ronald Fisher developed an extension to the t-test, in order to solve the problem of t-test and Z test: allowing to have only two levels of variable (1). The Analysis of Variance is called after Fisher, the Fisher analysis of variance, ANOVA or F-test. It is a collection of statistical models applied to compare mean differences in an dependent variable among three or more groups of an independent variable (2). For example, when we want to compare mean weight (DV) among female students (IV) in three counties of Maricopa, Coconino, and Yuma, we apply ANOVA.
In ANOVA, the dependent variable always should be interval or ratio and the independent variable has to be nominal or ordinal. Other than this, there are three assumptions for running ANOVA; For each group:
1. The dependent variable should be normally distributed (have a bell-shaped density (3)).
2. The variances of dependent variable should be equal (homogeneity of variance). We usually use Levene Statistics to assess this assumption (2), and it happens if significant-Levence is larger than alpha.
3. The cases are randomly selected and are independent of each other.
Usually in ANOVA, the null hypothesis and the alternative hypothesis are a follows:
Ho: μ1 = μ2 = μ3 = ...
HA: Not all means are equal
If the sig-F was smaller than alpha,then we could successfully reject the null hypothesis, and conclude that: there is a significant difference among different groups. In this case, the next step is to run Post-Hoc Comparison. It helps to identify pairs of groups which are significantly different. The Tukey’s Honestly Significant Difference (Tukey’s HSD) is of the methods in order to run Post-Hoc Comparison.
Running ANOVA in [R]:
In order to run ANOVA in SPSS and [R], we need a data set. I want to use Motor Trend Car Road Tests (mtcars) from package of datasets in [R], which has 32 observations on 11 variables (see figure 1).
Let’s say I want to study cars’ fuel consumption among different cylinders. Therefore, I take the Number of cylinders (cyl) as my independent variable, and the fuel consumption (mpg) as my dependent variable. In the original data set, there are three groups of cars based on the number of cylinders they have: four, six, and eight. Therefore, we have three groups in our study (k=3).
DV: mpg Ratio
IV: cyl Ordinal (k=3)
First, let’s check the required assumptions of ANOVA for our data set. Although, this data set does not meet the required assumptions, for now let’s assume that we ran the following lines and figured out that our data set met those assumptions. Note that in order to run Levene’s Test in [R], we need to install “car” or “Rcmdr” package.
# Input data
# Check Normality
C1 <- ggplot(data = df[L4,],aes(mpg)) +
geom_histogram(binwidth = 5)
C2 <- ggplot(data = df[L6,],aes(mpg)) +
geom_histogram(binwidth = 2)
C3 <- ggplot(data = df[L8,],aes(mpg)) +
geom_histogram(binwidth = 5)
# Run levene Test
Then, I run ANOVA in [R]. Figure 2 shows the result as well. The result indicates that there is a significant difference in fuel consumption of the three groups (4,6,and 8 cylinders).
# One Way Anova (Completely Randomized Design)
fit <- aov(mpg ~ cyl, data=df)
# Show result
# Diagnostic plots
# Alternative way, same result
fit2 = lm(formula = df$mpg ~ df$cyl)
Then, we should run Post-Hoc Comparison to identify which groups are significantly different. Codes to employ this test and the result are as follows. Surprisingly, R does not show this result, and I have not been able to figure out its reason yet. R only indicates that this factor (number of cylinders) is an important factor. However, we cannot know it in details (for instance, if only the difference between 4 cyl and 6 cyl is significant, or 4 cyl and 8 cyl).
TukeyHSD(x=fit, 'df$cyl', conf.level=0.95)
Running ANOVA in SPSS:
To run ANOVA in SPSS, I use the same data set with the same aim. The codes and the result are provided below. The result is the same as R result. However, in SPSS we can easily have Post-Hoc Comparison. According to figure 4, the consumption for all of these three groups are statistically different from each other.
DATASET ACTIVATE DataSet1.
ONEWAY mpg BY cyl
/STATISTICS DESCRIPTIVES HOMOGENEITY