Advanced Statistics Using R — 6

Vivekanandan Srinivasan
Analytics Vidhya
Published in
9 min readNov 23, 2019

If you have not read part 5 of the R data analysis series kindly go through the following article where we discussed Introduction To Statistics Using R — 5.

The contents in the article are gist from a couple of books that I got introduced during my IIM-B days.

R for Everyone — Jared P. Lander

Practical Data Science with R — Nina Zumel & John Mount

All the code blocks discussed in the article are present in the form of R markdown in the Github link.

To see all the articles written by me kindly use the link, Vivek Srinivasan.

In traditional statistics classes, the t-test — invented by William Gosset while working at the Guinness brewery — is taught for conducting tests on the mean of data or for comparing two sets of data. To illustrate this we continue to use the tips data from ggplot2.

data(tips,package="reshape2")
head(tips)

One-Sample T-Test

First, we conduct a one-sample t-test on whether the average tip is equal to $2.50. This test essentially calculates the mean of data and builds a confidence interval. If the value we are testing falls within that confidence interval, then we can conclude that it is the true value for the mean of the data; otherwise, we conclude that it is not the true mean.

t.test(tips$tip, alternative="two.sided", mu=2.50)

The null hypothesis is what is considered to be true; in this case that the mean is equal to $2.50.The output very nicely displays the setup and results of the hypothesis test of whether the mean is equal to $2.50. It prints the t-statistic, the degrees of freedom and the p-value. It also provides the 95 percent confidence interval and mean for the variable of interest. The p-value indicates that the null hypothesis should be rejected, and we conclude that the mean is not equal to $2.50.

We encountered a few new concepts here. The t-statistic is the ratio where the numerator is the difference between the estimated mean and the hypothesized mean and the denominator is the standard error of the estimated mean. It is defined in Equation

If the hypothesized mean is correct, then we expect the t-statistic to fall somewhere in the middle — about two standard deviations from the mean — of the t-distribution. In the figure given below, we see that the thick black line, which represents the estimated mean, falls so far outside the distribution that we must conclude that the mean is not equal to $2.50.

## build a t distribution
randT <- rt(30000, df=NROW(tips)-1)
## get t-statistic and other information
tipTTest <- t.test(tips$tip, alternative="two.sided", mu=2.50)
## Plot it
library(ggplot2)
ggplot(data.frame(x=randT)) +
geom_density(aes(x=x), fill="grey", color="grey") +
geom_vline(xintercept=tipTTest$statistic) +
geom_vline(xintercept=mean(randT) + c(-2, 2)*sd(randT), linetype=2)
t-distribution and t-statistic for tip data. The dashed lines are two standard deviations from the mean in either direction. The thick black line, the t-statistic, is so far outside the distribution that we must reject the null hypothesis and conclude that the true mean is not $2.50.

The p-value is an often misunderstood concept. Despite all the misinterpretations, a p-value is a probability, if the null hypothesis were correct, of getting as extreme, or more extreme, a result. It is a measure of how extreme the statistic — in this case, the estimated mean — is. If the statistic is too extreme, we conclude that the null hypothesis should be rejected.

The main problem with p-values, however, is determining what should be considered too extreme. Ronald A. Fisher, the father of modern statistics, decided we should consider a p-value that is smaller than 0.10, 0.05 or 0.01 to be too extreme. While those p-values have been the standard for decades, they were arbitrarily chosen, leading some modern data scientists to question their usefulness. In this example, the p-value is 5.0799885 × 10−8; this is smaller than 0.01, so we reject the null hypothesis.

Degrees of freedom is another difficult concept to grasp but is pervasive throughout statistics. It represents the effective number of observations. Generally, the degrees of freedom for some statistic or distribution is the number of observations minus the number of parameters being estimated. In the case of the t-distribution, one parameter, the standard error, is being estimated. In this example, there are nrow(tips)-1=243 degrees of freedom.

Next, we conduct a one-sided t-test to see if the mean is greater than $2.50.

t.test(tips$tip, alternative="greater", mu=2.50)

Once again, the p-value indicates that we should reject the null hypothesis and conclude that the mean is greater than $2.50, which coincides nicely with the confidence interval.

Two-Sample T-Test

More often than not the t-test is used for comparing two samples. Continuing with the tips data, we compare how female and male diners tip. Before running the t-test, however, we first need to check the variance of each sample. A traditional t-test requires both groups to have the same variance, whereas the Welch two-sample t-test can handle groups with differing variances.

To examine if both groups have the same variance we can either use standard F-test (via the var.test function) or the Bartlett test (via the bartlett.test function). But the data should follow a normal distribution to perform the above-mentioned tests. We will explore both groups have the same variance numerically and visually.

## Normality test
shapiro.test(tips$tip)
## Normality test for female
shapiro.test(tips$tip[tips$sex == "Female"])
## Normality test for male
shapiro.test(tips$tip[tips$sex == "Male"])

All the numerical tests clearly show that the data is not normally distributed. So we cannot use standard F-test or Bartlett-test to test the equality of variances. We can also check if the data is normally distributed visually as follows. Even the visual inspection suggests that the data is not normally distributed.

# all the tests fail so inspect visually
ggplot(tips, aes(x=tip, fill=sex)) + geom_histogram(binwidth=.5, alpha=1/2)

Since the data is not normally distributed we can use the nonparametric Ansari-Bradley test to examine the equality of variances.

This test indicates that the variances are equal, meaning we can use the standard two-sample t-test.

t.test(tip ~ sex, data=tips, var.equal=TRUE)

According to this test, the results were not significant, and we should conclude that female and male diners tip roughly equally. While all this statistical rigor is nice, a simple rule of thumb would be to see if the two means are within two standard deviations of each other. As usual, we prefer visualizing the results rather than comparing numerical values. This requires reshaping the data a bit.

library(plyr)
tipSummary <- ddply(tips, "sex", summarize,
tip.mean=mean(tip), tip.sd=sd(tip),
Lower=tip.mean - 2*tip.sd/sqrt(NROW(tip)),
Upper=tip.mean + 2*tip.sd/sqrt(NROW(tip)))
tipSummary

A lot happened in that code. First, ddply was used to split the data according to the levels of sex. It then applied the summarize function to each subset of the data. This function applied the indicated functions to the data, creating a new data.frame.

ggplot(tipSummary, aes(x=tip.mean, y=sex)) + geom_point() +          geom_errorbarh(aes(xmin=Lower, xmax=Upper), height=.2)

The results, clearly show the confidence intervals overlap, suggesting that the means for the two genders are roughly equivalent.

Paired Two-Sample T-Test

For testing paired data (for example, measurements on twins, before and after treatment effects, father and son comparisons) a paired t-test should be used. This is simple enough to do by setting the paired argument in t.test to TRUE. To illustrate, we use data collected by Karl Pearson on the heights of fathers and sons that are located in the UsingR package. Heights are generally normally distributed, so we will forgo the tests of normality and equal variance.

data(father.son, package='UsingR')
head(father.son)
t.test(father.son$fheight, father.son$sheight, paired=TRUE)

This test shows that we should reject the null hypothesis and conclude that fathers and sons (at least for this dataset) have different heights. We visualize this data using a density plot of the differences, as shown in Figure. We see a distribution with a mean, not at zero and a confidence interval that barely excludes zero, which agrees with the test.

heightDiff <- father.son$fheight - father.son$sheight
ggplot(father.son, aes(x=fheight - sheight)) +
geom_density() +
geom_vline(xintercept=mean(heightDiff)) +
geom_vline(xintercept=mean(heightDiff) + 2*c(-1, 1)*sd(heightDiff)/sqrt(nrow(father.son)),linetype=2)

ANOVA

After comparing two groups, the natural next step is comparing multiple groups. Every year, far too many students in introductory statistics classes are forced to learn the ANOVA (analysis of variance) test and memorize its formula, which is

Not only is this a laborious formula that often turns off a lot of students from statistics; it is also a bit of an old-fashioned way of comparing groups. Even so, there is an R function — albeit rarely used — to conduct the ANOVA test. This also uses the formula interface where the left side is the variable of interest and the right side contains the variables that control grouping. To see this, we compare tips by day of the week, with levels Fri, Sat, Sun, Thur.

tipAnova <- aov(tip ~ day - 1, tips)
tipAnova

The ANOVA tests whether any group is different from any other group but it does not specify which group is different. So printing a summary of the test just returns a single p-value. Since the test had a significant p-value, we would like to see which group differed from the others.

tipsByDay <- ddply(tips, "day", plyr::summarize,
tip.mean=mean(tip), tip.sd=sd(tip),
Length=NROW(tip),
tfrac=qt(p=.90, df=Length-1),
Lower=tip.mean - tfrac*tip.sd/sqrt(Length),
Upper=tip.mean + tfrac*tip.sd/sqrt(Length)
)
ggplot(tipsByDay, aes(x=tip.mean, y=day)) + geom_point() +
geom_errorbarh(aes(xmin=Lower, xmax=Upper), height=.3)
Means and confidence intervals of tips by day. This shows that Sunday tips differ from Thursday and Friday tips.

The figure shows that tips on Sunday differ (just barely, at the 90 percent confidence level) from both Thursday and Friday. To confirm the results from the ANOVA, individual t-tests could be run on each pair of groups.

Whether computing simple numerical summaries or conducting hypothesis tests, R has functions for all of them. Means, variances and standard deviations are computed with mean, var and sd, respectively. Correlation and covariance are computed with cor and cov. For t-tests, t.test is used, while aov is used for ANOVA.

So far in the data science series using R we have discussed how to use R for data preparation, building statistical plots and perform basic to advanced level statistical tests. From the next article, we will turn our discussion towards understanding machine learning algorithms using R.

Simple Linear Regression Using R — 6

Do share your thoughts and support by commenting and sharing the article among your peer groups.

--

--

Vivekanandan Srinivasan
Analytics Vidhya

An analytics professional with over six years of experience spanning across predictive modelling, statistical analysis and big data technologies.