In-depth intuition about T-test

Andrew Ryabchenko
Nerd For Tech
Published in
5 min readMay 3, 2021

The T-test is a powerful technique that can be used for hypothesis testing. It is well known by virtually any statistician, analyst, or data scientist. T-test belongs to a set of basic statistical tools and developing an in-depth intuition for this famous technique will be extremely useful as it helps to apply T-test with greater confidence. In this article, I will try to help you to develop a deep understanding of the T-test and illustrate the logic behind it.
I am a very curious individual and I don’t like to take things for granted but instead, I try to dive deeper and see why things work the way they work. When I first encountered the T-test technique during my journey through the data science program, I immediately needed to understand the logic behind this formula:

It was not intuitive to me why to subtract population mean from the sample mean to obtain t-statistic, but I decided to do my research. After further exploration, I realized that the grasp of T-test logic requires an understanding of its’ fundamental building block.

Central Limit Theorem

The formal definition of the central limit theorem states that if you have a population with mean μ and take sufficiently large random samples from the population with replacement, then the distribution of the sample means will be approximately normally distributed with the mean equal to the population mean μ. To illustrate that I derived simple visual proof using Python. Let’s first test the above assumption on a uniform distribution where all numbers have an equal chance to be drawn.

The uniform distribution that I used for that purpose is generated from a thousand numbers ranging [0,100]. Let’s draw 15, 50, and 200 random samples of ten numbers each and visualize distributions of their means.

As seen, the larger the amount of sample means, the closer it resembles normal distribution and the mean of that distribution gets closer to the population mean of 49.53.

Let’s test the central limit theorem on a normal distribution of thousand numbers with a mean of 79.25 and a standard deviation of 5.

Using the same constraints we draw 15, 50, and 200 random samples of ten numbers each and visualize distributions of their means.

The same observation appears to be true here as well. The larger the amount of sample means, the closer it resembles normal distribution and the mean of that distribution gets closer to the population mean of 79.25.

Using our `building block` to understand the logic behind T-test

It is useful to apply what we have learned so far to a practical example. Let’s suppose we are given the task to test how effective the new menu of the clients’ restaurant in terms of increasing the gross earnings. In particular, we want to see if the change in the average check is statistically significant (the term statistical significance implies that the change is most likely not a result of sampling randomness). Before the new menu was introduced, the average check of the customer table was 66.9 dollars. This number will be our population mean μ. Sometime after the new menu was introduced we go into a database and take a random sample of 100 customer checks. The average check now is 81.3 dollars with a standard deviation of 8.47 dollars, so our sample means is 81.3 dollars. Let calculate t-statistic using the formula

To keep the article short I’ll skip the calculations part. The resulting t-statistic that I obtained is 17.00. Now we need to refer to the t-table to determine the likelihood of obtaining such value.

Our t-statistic is large and it is not present in the table, therefore, the likelihood of getting a sample mean that has such difference from the population mean is extremely small. We can state that we are 99.9% confident that the change in average check is statistically significant and is not a result of sampling randomness.

We are getting to the point of this article, the intuition behind the entire process. Let’s get back to the distribution of the sample means. We know that it’s normally distributed and has a mean equal to the population mean. Therefore, finding the difference of sample mean and population mean in t-statistic formula is similar to finding the difference of sample mean and mean of the distribution of sample means. Since the distribution of sample means is approximately normal, the probability of its’ values is well defined as it is for any normal distribution. When the magnitude of the value of t-statistic is large, the sample mean is far from the mean of the corresponding distribution of sample means and therefore, the mean value that we study is not likely to originate from the distribution of sample means, and logically follows that the sample is not likely to be drawn from the population. But how it is not likely to be drawn from the population if it is drawn from there? Well, the answer is simple. After the new menu was introduced, the hypothetical population of all future customer checks have changed in some way and now it does not resemble the old population of checks before the updated menu.

I truly hope that this article helps you to better understand the T-test. Thank you for reading!

--

--