Hypothesis testing — one sample t-test and groups comparing in R

Data analysis using R in Six Sigma style — Part 3

Rafal Burzynski
11 min readMay 22, 2023

This article is a part of series titled “Data analysis using R in Six Sigma style”. If you have not read Part 1 and Part 2, here are the links to them:

All script files are loaded into this Github repo:

If you want to run R scripts in VS Code, you can read a dedicated article that will show you how to set things up.

In this part we will look at groups comparing, which allow to compare samples taken from two processes (before and after improvements) and to check if there is a statistical difference between them. For this check a few statistical tests are used and they are using what is called a hypothesis testing.

Hypothesis testing — what is it about?

Hypothesis testing, when encountered first time, can be a bit confusing. The way I often try to explain it, is the following: let’s imagine that there is a commonly accepted fact, a status quo, something that many believe it to be true. This status quo is called a null hypothesis in hypothesis testing. In order to question status quo, and thus propose an alternative view on the fact, some data must be collected and then checked against status quo. The alternative view on the fact connected with the data I’m collecting is called an alternative hypothesis. Finally, I must run a specified test and draw some conclusions based on the results of the test.
It may come out that null hypothesis is actually true and cannot be rejected. On the other hand my test may show that I can reject the null hypothesis and accept the alternative. Of course by rejecting the null hypothesis based on the evidence my test provided, I’m taking some risk and there is a small chance that the null hypothesis I have rejected, actually remains true. Obviously, I want to decrease the risk of committing such error as much as possible. In the world of statistical testing the accepted risk is usually 5 % or 0,05 and for that reason the threshold in hypothesis testing we will see (and actually saw in Part 1 of this article series) is going to be 0,05 expressed as a p-value.
The risk of rejecting the null hypothesis when it is actually true is called the alpha error.

If you want a deeper insight into hypothesis testing and interpretation of p-value, I recommend a few videos by Cassie Kozyrkov (other videos from the series Making Friends with Machine Learning too). For hypothesis testing I suggest to start with MFML 072 (linked below) and continue to MFML 075.

As I already mentioned in Part 1 of this article series, we used Shapiro-Wilk test for normality and this is an example of hypothesis testing. The null hypothesis in this test is that the data follows a normal distribution and the alternative hypothesis is that it is not normally distributed. That means that as long as p-value from this test is above the threshold of 0,05 I should not reject the null hypothesis and treat the data as normally distributed. As we will see later in this article that information is crucial for groups comparing to select a test for comparison of centering.

In this article we will use hypothesis tests that have null and alternative hypotheses defined. However, when we have to select the null hypothesis ourselves, then null hypothesis should be unfavorable for us. This means that we need to collect the data that prove alternative hypothesis that is favorable result we want to see.

In this article we will focus on such tests as one sample t-test and two sample t-test, as well as F-test for equal variances.

One sample t-test

One sample t-test is used to check if a certain number could be treated as a mean value for the population represented by measurements.
When we take a sample out of population and measure it, we can then describe population with mean and standard deviation if our measurement data is normally distributed. Because we are describing population with numbers (mean, standard deviation) received from the sample of population, if we repeated the same experiment many times, the mean value received would be slightly changing (there is a nice way to show it with bootstrapping technique and I will write a separate article about it). To account for this variation caused by representation of a population with a sample, we give confidence intervals for both mean and standard deviation. Typically, 95 % confidence interval is calculated.
For a mean, the confidence interval is calculated with use of t-distribution (also known as Student’s distribution). Hence the name t-test.
One sample t-test means we have distribution from one sample and confidence interval of its mean and we compare a certain number against this confidence interval.

Now let’s look at code implementation. We will use a different dataset than in Part 1 and Part 2 of the article series.

> head(df)
Sample_1 Sample_2 Sample_3
1 55.80 67.88 66.70
2 55.71 64.87 61.00
3 53.88 64.92 60.20
4 55.58 62.86 59.99
5 54.54 63.40 57.18
6 54.22 61.67 62.08

We will use Sample_1 for one sample t-test. The other 2 will be used for groups comparing.

One sample t-test is fairly easy in R. There is a dedicated function for it. But first we need to check if Sample_1 has a normal distribution to run a t-test as a next step. Here is the R code implementation:

# 1. One sample t-test
# ==========================
M <- df$Sample_1

# normality test
shapiro.test(M)

# check if the sample mean is 55
t.test(M, mu = 55)

# check if the sample mean is 52
t.test(M, mu = 52)

First we assign column Sample_1 to variable M and run Shapiro-Wilk test for normality:

> shapiro.test(M)

Shapiro-Wilk normality test

data: M
W = 0.97176, p-value = 0.2726

P-value is bigger than 0,05 so the data can be described with a normal distribution.
The first one sample t-test will check if 55 is a mean value for Sample_1 dataset. The second t-test will do the same but for 52.

> t.test(M, mu = 55)

One Sample t-test

data: M
t = -10.06, df = 49, p-value = 1.652e-13
alternative hypothesis: true mean is not equal to 55
95 percent confidence interval:
51.63274 52.75406
sample estimates:
mean of x
52.1934

On the print out of the results for the first one sample t-test we can see what is an alternative hypothesis: true mean is not equal to 55, so the null hypothesis in this test is defined as: true mean is equal to 55. P-value for this test is so low that it can be practically assumed to be 0. With that we can reject the null hypothesis and the result of this hypothesis test is that 55 is not true mean for the data. The other thing worth noticing is the 95 % confidence interval for mean of the dataset. It is from 51.63 to 52.75, so 55 is definitely outside of this confidence interval. Finally at the bottom of the print out we can see the mean value as a point estimate. It is the same as the value returned by mean(M) function.

So the second test is selecting 52 on purpose, which is within confidence interval. And the result is the following:

> t.test(M, mu = 52)

One Sample t-test

data: M
t = 0.6932, df = 49, p-value = 0.4915
alternative hypothesis: true mean is not equal to 52
95 percent confidence interval:
51.63274 52.75406
sample estimates:
mean of x
52.1934

Here p-value is bigger than 0,05 so the null hypothesis cannot be rejected.

Groups comparing

Very often we collect the data before we change something in the process (hopefully improve it) and also after the change. So of course we want to compare the two measurements and conclude if the two samples are statistically different. In other words if the result of our actions changed something in the process or we still operate within common cause variation.
Groups comparing is a type of comparison when resulting function is continuous but the input is discrete (sample A and sample B).

There are also other possibilities:
- resulting function is continuous and the input is also continuous, then we use linear regression;
- resulting function is discrete and the input is also continuous, then we use logistic regression;
- resulting function and input are both discrete, then we use chi-square;

However, in our case the data we collected (resulting function) is continuous, and the input is either Sample_2 or Sample_3, which we want to compare against each other. And for that reason in this article we will focus on groups comparing.

Groups comparing has a certain workflow with steps that need to be completed to draw conclusions from comparison. The comparison of two samples is the easiest option but it also gives a good overview how to approach groups comparing.

As the first step we need to check the stability of the data. This can be done with I-MR chart, which was introduced in Part 2 of this article series. One important note: the dataframe we loaded for this script contains some NA positions, which must be dropped before running an analysis. Fortunately, it can be done quite easy with one line of code using pipeline, which in case of this script also requires loading tidyverse library (see the end of the article to see the full script).

# 2. Two sample t-test
# ==========================

# cleaning data to drop NA
df2 <- df[2:3] %>% drop_na()

# sample 1
s1 <- df2$Sample_2
# sample 2
s2 <- df2$Sample_3

# Step A - stability of data
I_MR(s1)
I_MR(s2)
I-MR Chart for Sample_2
I-MR chart for Sample_3

Looking at I-MR charts for both Sample_2 and Sample_3, one can see that it is not perfect. There are still points outside of control limits and in case of Sample_2, it is possible to observe a trend. Normally, this requires further investigation to understand what causes the trend and points outside of the control limits. However, for the purpose of this article we will accept the data as it is and proceed to the next step which is check of the normality of the data for Sample_2 and Sample_3.
Here the code it a bit extended as in Part 1 to show both qq plots and histograms for samples:

# Step B - checking for normality
shapiro.test(s1)
shapiro.test(s2)
par(mfrow = c(2,2))
qqnorm(s1)
qqline(s1, col="#750505")
qqnorm(s2)
qqline(s2, col="#750505")
hist(s1, col = "steelblue", breaks = 15)
hist(s2, col = "steelblue", breaks = 15)
par(mfrow = c(1,1))

-----------

> shapiro.test(s1)

Shapiro-Wilk normality test

data: s1
W = 0.97181, p-value = 0.6098

> shapiro.test(s2)

Shapiro-Wilk normality test

data: s2
W = 0.95443, p-value = 0.2381

Based on the result of normality test, we can conclude that the data for both samples can be described with normal distribution, although the qq plots and histograms do not look perfect. However, in real life very often we will collect data like the one we analyze in this case.
Normality test has been needed to confirm that both samples can be described by normal distribution and confidence interval can be calculated for both of the means and then compared with each other. This is how two sample t-test works. And the use of it is also pretty straightforward but before we run two sample t-test we need to check one more thing — compare variances and this can be done with a F-test.

# Step C - checking for equal variances
# F-test
var.test(s1, s2)

---------
> var.test(s1, s2)

F test to compare two variances

data: s1 and s2
F = 1.0962, num df = 28, denom df = 28, p-value = 0.8097
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.5146733 2.3348554
sample estimates:
ratio of variances
1.096215

The F-test is comparing the ratio of two variances. We did not introduce the term variance before. But it is easy to understand it. It is basically squared standard devition. If variances are equal, their ratio will be 1, and that is what F-test is checking. Similarly as in case of one sample t-test, we can see what is an alternative hypothesis, so we can conclude the null hypothesis. P-value for our test is much bigger than 0,05, so we cannot reject the null hypothesis, which is formulated that true ratio of variances is equal to 1. In other words that the variances are equal. This is the last piece of information needed to check centering (comparison of mean values) with the use of two sample t-test.

If the samples did not have a normal distribution, we should use other type of tests for centering comparison like Mood’s median test. But this is not our case and the code for us is the following (look at the parameter var.equal):

# Step D - checking centering
t.test(s1, s2, var.equal = TRUE)

boxplot(s1, s2, col="steelblue")

-----------
> t.test(s1, s2, var.equal = TRUE)

Two Sample t-test

data: s1 and s2
t = 4.0886, df = 56, p-value = 0.0001401
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
2.035431 5.945948
sample estimates:
mean of x mean of y
60.40138 56.41069

In that case p-value is practically 0, so the null hypothesis can be rejected and we can conclude that the mean values of the two samples are different. We can see the means for both samples at the bottom print out and the confidence interval is given for the difference between means. Even for the lower confidence interval the value is greater than 0. The boxplot also shows that the mean value for second compared sample (Sample_3 in the original dataset) is lower than for the first one (Sample_2 in the original dataset).

Conclusion

This article introduced two important things that are frequently used: the concept of hypothesis testing and later groups comparing.

Hypothesis testing might seem hard when you start testing your data in that way but after a while you become more and more accustomed to it. And as we showed in this article it is used pretty often in data analysis.

Groups comparing is another useful workflow for comparing data gathered from a process before and after the change and it allows to check if the change that was introduced caused improvement that is statistically significant or not.

The next article (Part 4) will focus on linear regression.

And traditionally the full script is available below:

--

--

Rafal Burzynski

I like to learn stats, data science and coding in Python and R. Then I publish what I have learnt. See my other work at https://rafburzy.github.io