Data Scientist Must Know — A quick guide to two sample tests
Two sample tests made easy
As its name suggest, the two sample mean test is used when the researcher is interested in two populations. This test is used when you take two samples from two different populations to see whether they have the same mean or not.
For two sample parametric tests, there are four different tests. Each with different requirements and methods. This post will serve as your guide to pick the correct test and how to use them.
The Hypothesis
Just like the single sample test, before we do the test, we need to determine the hypothesis. Since this is the two sample test, the hypothesis will be slightly different than the single sample test. However, you should notice the similarities. Below is the hypothesis of the two sample tests
Notice that just like the single sample test, there are three different hypothesis that you can take. And just like the single sample test, the test you use does not depend on the hypothesis. After determining your hypothesis, we can proceed to the tests.
The Z-test
Just like the single sample mean test, this Z-test relies on the central limit theorem. The assumptions for this test are
1. The distribution of the populations is unknown/known (doesn’t really matter)
2. The variance of both populations is known
3. The sample size of both population is reasonably large
Now that we know the requirements, here are the steps on doing test
1. Calculate the average from the two samples
2. Calculate the test statistic (denoted as z) with the formula below
3. Compare the z value with the appropriate critical value based on the standard normal distribution and the error used for the test
The T-test
Again, as the counterpart of the single sample T-test, this test relies on the sample variance when the variance of both samples are unknown. However, unlike the single sample T-test, the two sample counterpart is more complicated. The two sample T-test has two different types, the equal variance test and the unequal variance test. As their name suggests, the tests depend on the equality of the population variance. Here are the two different T-test that you can do
The equal variance T-test
This test is also called the pooled T-test. The assumptions of this test are as follows
1. Both samples come from a normal distribution
2. The variance of both population is unknown and equal
3. The sample size of both samples are irrelevant (however, the larger the better)
Notice that this test requires the normality of both samples, which is a lot more strict than the single sample T-test.
The steps for the test are as follows
1. Calculate the average of the two samples
2. Calculate the pooled variance with the formula
3. Calculate the test statistic (denoted as t) with the formula
4. Compare the test statistic with the appropriate critical value based of the t-distribution with the degree of freedom as the denominator of the pooled variance formula and the error used for the test
The unequal variance t-test
Most of the time, we can not assume that the variance of both populations are equal. In this situation we use the unequal variance T-test. The assumptions of this test are as follows
1. Both samples come from a normal distribution
2. The variance of both population is unknown and unequal
3. The sample size of both samples are irrelevant (however, the larger the better)
The steps for the test are as follows
1. Calculate the average of the two samples
2. Calculate the degrees of freedom of the t-distribution (denoted as v) with the formula
3. Calculate the test statistic (denoted as t) with the formula
4. Compare the test statistic with the appropriate critical value based of the t-distribution with the degree of freedom as v (calculated at step 2 of the test) and the error used for the test
Notice that for both the equal and unequal variance test, it assumes the normality of both samples. If the samples are not normal, I suggest using a non parametric test instead of this test.
The Paired Observations T-test
The previous tests we did have assumed the independence of samples. However, what if the samples are dependent, more specifically what to do if they are paired? Paired observations, as it names suggests, are observations that comes in pairs and are usually taken from a single sample. Most of the time, paired observations are used in testing a before and after affect. The assumption for this test are as follows
1. Both samples come from a normal distribution
2. The variance of both population is unknown and unequal
3. The sample size of both samples are irrelevant (however, the larger the better)
The steps for the test are as follows
1. Calculate the difference of the samples, denote the difference as d
2. Calculate the average and sample variance of the d
3. Calculate the test statistic (denoted as t) with the formula
4. Compare the test statistic with the appropriate critical value based of the t-distribution with the degree of freedom as n-1
A simple to follow flow chart
To make things easier, you can use this simple flowchart to determine which test to use