Understanding Significant Levels
Statistical tests are an integral part of a data scientist's repertoire. Every day, we clean, sort, and model data with the assumption that the differences we find in the numbers actually matter.
Does the salary of Segment A and Segment B of the population really differ enough to matter? Can we explain the two-dollar difference by sheer luck in sampling? That is why we have to do statistical tests — to show that there is more than a likely chance our assumptions matter.
Hypothesis Tests Overview
When we use data, we know that the numbers we work with are not every example of the wider world. We work with samples because it is impossible to get 7 billion surveys or 7 billion responses or 7 billion of anything from everyone on the planet.
The samples we work with stand in as the data from the bigger population and because of this, we have to make certain that our sampling doesn’t lead us to believe untrue conclusions about the population. That is why we do statistical tests and why significance levels are so important.
Statistical tests allow us to determine if the differences we see in two segments from the sample are truly different. And that’s where significance levels come into play. When we conduct hypothesis tests we are testing one phrase, known as the null hypothesis. That phrase is: “There is no difference between Segment A and Segment B.”
All of our results either allow us to reject this phrase or mean we fail to reject it. Our significance level is usually set at 0.05, which equates to 5%. So if a p-value comes in at 0.12, or 12% — we say we fail to reject the null hypothesis and most likely there is no difference between the two segments. But if the p-value is 0.03, or 3% — we say we reject the null hypothesis in favor of the alternate hypothesis, which states that there is a difference between Segment A and Segment B.
The significance level — also known as the alpha, is the level of assurance we want when conducting the hypothesis tests. It’s the probability of receiving a Type I error. However, there is a school of thought that the normal 5% is an arbitrary number that doesn’t translate well to different data sets and should instead scale with the number of observations.
If I had two data sets, one with 200 observations and another with 30,000 observations — the statistical differences have to be measured differently. A difference between segments in a sample of 200 is very different than in a sample of 30,000. The larger data set could fool the hypothesis test into producing a false negative when checking differences.
As the number of observations grows, a small difference in the segments becomes significant and can lead to very low p-values. Lowering the alpha for these situations accommodates for the naturally lower p-values.
In the two cases above, the sample sizes were a couple of orders of magnitude apart. So we need a way to determine a new alpha that takes scales with data.
Below is a formula that is inversely proportionate with the number of observations. This formula will deliver a smaller significance level as the number of observations increase.
The use of 100 as the constant to divide the number of observations is a random choice done by several studies and can be replaced with other numbers. The point is to use the number of observations as a reducing factor for a dataset that is much larger than the constant.
The common flaw in the argument of using smaller significance levels automatically — using 5% vs 10% or 1% — is that these numbers are also arbitrarily set. An alpha 5% on a 1000 observation data set and 1 million will see varying levels of differences and shouldn't be used interchangeably.
If you would like to talk more about statistical tests, connect with me on LinkedIn. I love hearing more about a person’s coding journey because it inspires me to continue growing.
You can check out my projects on Github and give me a shout if there is something there that interests you.
I am also on Twitter where I share my projects, data puns, and thoughts on cool uses for data in contemporary ways.