# OLAP Cube — Check yo’ data before you wreck yo’ data

An ANOVA is one of the most common types of statistical analyses performed within the social sciences. It allows you to compare two or more means to see the differences. However, what many people forget to check are the assumptions which can lead you to finding incorrect results or even misleading results. While there are many different assumptions that your data could violate within the one-way ANOVA we will be looking at three main ones: skewness, outliers, and homogeneity of variance.

Many people who have taken an entry statistics course understands the basic idea of skewness. Skewness is the measurement of symmetry within the dataset. It can be lean to the left (positive skew) or to the right (negative skew). To make sure that this assumption is not violated we can use the skewness() function to apply the skewness function as shown below.

When a dataset is skewed it is important to look at whether the dataset is heav-tailed or light-tailed, in terms of distribution. To observe the kurtosis of the data set we can use the kurtosis () function. It is important to note that the skewness() function and the kurtosis() function are both in the R package e1071. The data used was the OKCupid data file from class.

Once we look at the skewness of the data and observe the kurtosis, we can see whether or not outliers. With outliers, it’s worth noting that if one subject is an outlier for several dependent variables, you might want to consider removing the data. Outliers can affect the assumptions and also the results of your research (for more examples on when you should drop data points read the blog Outliers: To Drop or Not to Drop in the references section). When trying to find where there are outliers within the data, you can use the aq.plot().

When looking at different groups and comparing the means between them, it is important to look at the homogeneity of variance. What homogeneity of variance looks at is whether or not the comparison groups have the same variance. If this assumption is violated, the F statistic will be biased and significance levels will be either underestimated or overestimated. To make sure this assumption is not violated we can use either the bartlett.test() function or the fligner.test() function.

One thing that needs to be stressed when performing statistical analyses is that if your data violates one of the above assumptions, or the ones not listed, you could be reporting data that is incorrect. I really hope some of you got the Ice Cube reference in the title.

References:

The Assumption of Homogeneity of Variance. (2017, March 30). Retrieved from http://www.statisticssolutions.com/the-assumption-of-homogeneity-of-variance/