What Data Scientist Should Know About Statistical Experiments and Significant Testing

Published in

Data Folks Indonesia

5 min readSep 26, 2021

Machine Learning, Supervised Learning, Unsupervised Learning, Ensemble, Stacking, Boosting, Deep Learning, named every concepts that pops-up in your mind that mostly people talk about. I don’t say those aren’t important, but as a data scientist, we also need to understand the fundamental approaches, one of those basics is statistical experiment and testing. This article is inspired from Practical Statistics for Data Scientist, if you are coming from CS and need to understand the basic components of statistics, you should read the book.

There are a lot of statistical testing, absolutely. But, in this article, I only list and explain some of the commonly used statistical testing and popular concepts.

1. A/B Testing

A/B Testing is the most popular when we talk about experiments. Either it is about website design, product placements, order flow, etc. The aim of A/B Test is that you set the two groups A and B and prove that which group is more successful than the another. It is often known that one of the group is the current version which is called the control group. We usually hypothesize that the proposed approach is better than the status quo.

Users, visitors, customers, animals or anything that you want to test to receive the treatment can be assign to test one group or another. We randomize the subjects to receive the treatment, this is commonly used, many A/B test platform like Google Analytics has this feature in the platform. We randomize the subjects is to make the test is fair because the underlying facts that our treatment work is two folds, either the proposed treatment makes a significant difference or it is just by the chance. Sometimes, the randomization is not that random, especially when you run a campaign or promotion, you also need to look at the subjects behavior, is there any effected certain demographic that lead to the result, or is it pure the proposed treatment.

Some of the target that we measure when we conduct A/B Test experiments:

Clicks/Interactions
Engagement in a set of process (Funnel)
Session time
Purchases

We need to set this metrics in the first place. Looking back what is our goal when we are conducting this experiment and we set all the metrics that we can and decide later, this will cause researcher bias.

2. Hypothesis Tests

Creating hypothesis is a crucial part in the statistical experiment. You may often heard this null hypothesis and alternative hypothesis

Null Hypothesis: A hypothesis that has no significant different of two or more distributions
Alternative Hypothesis: A hypothesis that is to prove has a significant difference between two or more distributions.

Every time we start to conduct an experiment, creating a good hypothesis is play an important role during the process. Without this, your experiment will be lost in direction, you will not know what you want to do. Example of hypothesis is that design A gains longer session time than design B. Creating hypothesis will implicate your further analysis.

3. Resampling

Resampling in statistics means to repeatedly sample values from observed data, with a general goal of assessing random variability in statistic. It can also be used to assess and improve the accuracy of some machine learning models (e.g., the predictions from decision tree models built on multiple bootstrapped data sets can be averaged in a process known as bagging).
There are two main types of resampling procedures: the bootstrap and permutation test. The bootstrap is used to assess the reliability of an estimate; it was discussed in the previous chapter. Permutation tests are used to test hypotheses, typically involving two or more groups. — practical statistics for data scientist, Bruce et al.

How to Resample a.k.a permutation

Stack all the result from groups testing A/B or more.
Calculate the size A and size B, say A is 32 obs and B is 30 obs
Shuffle the data
Select the shuffled that with the same size in each group
Calculate the differences of the metric
Repeat N times to draw permutation distribution.

4. Statistical Significance and p-Values

“Statistically significant” is a phrase that I often hear from statistics videos, but what does it means? Well, statistical significance is the way we measure whether the proposed approach give a significant result different from what chance might outcome. If the proposed method results distribution is beyond the control group, the it is said to be statistically significant.

p-Value

Simply looking at the graph is not very precise way to measure statistical significance, so of more interest is the p-value. this is the frequency with which the chance model produces a result more extreme than the observed result. We can estimate a p-value from our permutation test by taking the proportion of times that the permutation test produces a difference equal to or greater than the observed difference.

Alpha

For me, Alpha like a threshold where we can set depends on the use case, it is common to set the alpha 5% which means 0.05. It means that given 100 samples, only 5 samples that can overlap from the opposite distribution. If the overlap data points more than the threshold, it means that there is no significant different between proposed method than the status quo.

5. t-Tests

t-Test is one of the statistical testing and used to determined if there is a significant difference between the means of two groups. This t-Test is used along with hypothesis testing, and you need to set null hypothesis and alternative hypothesis. The null hypothesis is accepted if the result of t test is below alpha and alternative is otherwise.

6. ANOVA

analysis of variance (ANOVA) is a statistical procedure that tests if there is any significant difference among the groups. Think like you have 3 treatments and 1 control group e.g A/B/C/D test. You want to see if the those groups has significant difference over distribution.

Two-Way ANOVA

The A/B/C/D test just described is a “one-way” ANOVA, in which we have one factor (group) that is varying. We could have a second factor involved — say, “weekend versus weekday” — with data collected on each combination (group A weekend, group A weekday, group B weekend, group B weekend, and so on). This would be a “two-way ANOVA”, and we would handle it in similar fashion to the one-way ANOVA by identifying the “interaction effect”. After identifying the grand average effect and the treatment effect, we then separate the weekend and weekday observations for each group and find the difference between the averages for those subsets and the treatment average. — Practical Statistics for Data Scientist

7. Chi-Square Test

The methods that we mentioned above is for interval data, how about frequency data. We use Chi-Square Test. The Chi-Square Test is used with frequency data to test how well it fits some expected distribution. A Chi-Square Test mainly has two purposes, Chi-Square for Goodness of Fit, to see if the sample data match the population. Chi-Square for Independence to see whether distribution of categorical variables differ from each other.

Conclusion

So, yeap that’. those are the statistical test that you have to know to conduct experiment. Data science is not only about making a state-of-the-art model, but also bring the actionable insight from the data.

If you enjoyed this post, feel free to hit the clap button 👏🏽 and if you’re interested in posts to come, make sure to follow me on medium