Hypothesis Testing

9 min readMay 26, 2024

I’m currently playing the role of a data analyst for a product that monitors and enhances restaurant standards. Honestly, I’m not satisfied with the current state of our product development, where features are continuously being developed and pushed to production without making decisions based on a broad user base during user testing (UT). Decisions to change something in the UI are often made based on gut feelings — because someone feels it should be done, so they do it, or a few users feel it should be done, so it gets developed.

I’m planning to write a short article to share about Hypothesis Testing, or just A/B testing, and present it to the product owner so they can consider implementing A/B testing in our product development process. But before that, I think it’s important to solidify my knowledge before proposing it to everyone, focusing only on the tip of the iceberg. And so, I’m writing this article. Actually, I’ve had this intention for a while, but I hope to deliver it before I find a better opportunity elsewhere (again this saying hhh).

1. What is hypothesis testing ?

Let’s drop a definition here and not to talk much about this as I hope everyone has already got that:

Hypothesis Testing is a type of statistical analysis in which you put your assumptions about a population parameter to the test. (* )

2. Common hypothesis testings

In this article, I will share how to choose the right type of testing to meet the needs of our product/business and highlight some pitfalls related to hypothesis testing. I won’t delve deeply into the underlying mechanisms but will focus on explaining the usability of these types of tests.
As usual, let’s set H0 as the Null hypothesis and H1 as the alternative hypothesis — the hypothesis that you want to prove.

2.1 Parametric tests

What is parametric test?
It is the test that is used under assumption that your data is normally distributed (*).
There are some main parametric tests mentioned below:

Let’s talk about z-test and t-test in detail,

Test directionality:

One-tailed test: Chose when you have specific direction of interest. E.g. H1: scheme A attracts higher number of users to platform more than scheme B.
Two-tailed test: Choose when you are interest in any significant difference regardless of direction. E.g. H1: There is an impact on click-through-rate from the new scheme compared with the current scheme

Basically we make new changes and we of course expect it to make positive impact, or, to make the impact in the way we want it to be, not only to know if it cause impact or not. So actually one-tailed test is used more frequently in customer, marketing and operation analytics, to highlight some.
Depending on the test design and objective can we determine which kind of test to take as below.

So when you say we will do hypothesis testing, it’s not just about splitting (which is AB testing), it could be about testing the same group before and after a change, or, 2 related groups for a change. However, in business, splitting test, I.e. unpaired tests are widely used because of the following reasons:

Ease of implementation:

AB testing is technically straightforward, and test design is easily conducted as you don’t need to track users one by one (which you have to do in paired testings).
Groups are selected randomly, hence both samples can be observed simultaneously, which reduces times need to gather results (unlike paired testings where you have to wait before and after experimenting).
because of such simultaneously observations and judgement, business can make quick decisions, enabling a lean and fast approach to optimization

Reduce risk of bias: As groups are selected randomly, we can make assumption that 2 groups are with the same characteristics, and independent from each other because there is no overlap between groups, and each object is singly treated.

Avoid carryover effects: In paired testing, where the same user experiences both versions (e.g., before and after scenarios), there is a risk of carryover effects. Users’ experience with the first version might influence their reaction to the second version. A/B testing eliminates this issue by ensuring users are exposed to only one version.

Let me take an example in last-mile delivery, assuming you launch a new rewarding scheme for driver to, in the end, increase the number of trips per driver:

Hey but that doesn’t mean paired test should not be used. When you want to observe the effect on exactly the same set of user, a before-after comparison would be suitable. In this case, the characteristics of sample are something that’s unchanged but the treatment. When comparing with A/B test, we tell each other that we select the groups randomly but we’re not sure whether those groups bear the same characteristics. Even after a chi-square test, are you sure to list out all the traits of such groups to test independency? And even you have listed out all and successfully prove, are you sure they’re not exposed to other schemes in other tests? In a business where new products and features are constantly released, many tests are conducted at the same time. Can you monitor them all?

Hypothesis testing is a vulnerable task :)

So what can we even do? :)) Make the assumption that bear least risks. If you want to focus on single user/object and you can monitor one by one, chose paired, else let’s go with unpaired. There is one more approach that I also want to introduce with you, that’s the Causal ML. But I’m not gonna cover it so just drop a link here.

2.2 Non parametric test:

We usually jump right away to measuring the test effect without considering the data distribution, data characteristics and the purpose of the test itself. When it comes to real world problems, not all data is continuous, or normally distributed. I mean in some cases you have to do the test with ordinal, discrete, or categorical data — which are the conditions to apply t-test, z-test,..aforementioned. Then how to deal with the other cases? Here comes the non-parametric tests.

What are non-parametric tests?

Nonparametric tests are methods of statistical analysis that do not require a distribution to meet the required assumptions to be analyzed (especially if the data is not normally distributed). Due to this reason, they are sometimes referred to as distribution-free tests.

It is a traditional alternative approach when you have no or little idea about the data distribution or the population distribution. Let’s take an example in customer analytics in restaurant operation domain, you might be working with customer satisfaction, which is directly collected from customer reviews from surveys (after they have used your services). What did you ask customers for their experience? Yes, it’s the rating, usually in Likert scale, a typical ordinal data type that can easily be treated as continuous case, like I did the same with it before :v. Anyway, let’s talk about the common hypothesis testing types and their purposes:

3. Not-frequently asked questions — with messages:)

How to calculate sample size needed for a test?

Introducing with you the following parameters needed to calculate sample size:

Significance Level (α): The probability of rejecting the null hypothesis when it is true. Commonly set at 0.05.
Power (1-β): The probability of correctly rejecting the null hypothesis when it is false. Commonly set at 0.80 or 0.90.
Effect Size (Δ): The minimum difference between the two groups that you want to detect. This is often expressed as a proportion or percentage.

Then sample size is calculated for specific tests, as discussing about this requires Mathematical understanding, which is out of this blog scope, I will leave the link here FYI: https://sphweb.bumc.bu.edu/otlt/mph-modules/bs/bs704_power/bs704_power_print.html. I usually input the parameters into online tool for fast calculation, some of them https://www.statskingdom.com/sample_size_t_z.html, https://www.evanmiller.org/ab-testing/sample-size.html

Should we switch to other test as our sample size is already over 30?

Not really. I will change the question into: When do we need to change from t-test to z-test? It’s when our data complies with normal distribution. According to law of large number, as the amount of record is bigger, data distribution is in bell-shaped curve, which is in z-distribution. When your sample size is just over 30 but does not in normal distribution (very high chance believe me), you should consider increasing sample size to make more trustworthy conclusion. Don’t always bet on number 30. Try to acquire for observation if possible.

Why don’t we “accept the null hypothesis” ?

Let’s get to why we can only “Reject/Cannot reject the Null Hypothesis”? Simply we can’t just find all the cases to prove Null Hypo is true, given the null hypothesis is true, we just found one case(that is the current observation), and with one fault, we can reject the whole argument to prove it wrong, at the same time, with only one case observed, we cannot prove it right (but have to prove all the cases of Null Hypothesis to be true to prove it true). You get me.

Isn’t that a bit loose anyway though?

No, because we’re actually targeting the H0 hypothesis. In other words, let’s call hypothesis testing the “Null hypothesis testing” since all we can do is try to make some conclusions on it. Then, the alternative hypothesis will be automatically concluded based on the results.

The counter metric is usually forgotten, also. What is counter metric?

Let’s say you’re running a cash back scheme for users to get them back to your platform. H1 is “Conversion rate in experiment group is higher in conversion rate in control group”. You then successfully reject null hypothesis. You tell your team to launch the scheme massively. Then your department is out of budget :) what to take into consideration is not only the main metric, but also the counter one — which is the cost per user in this case. Don’t just focus on whether to adopt the alternative hypo. Focus on how feasible the whole test is.

Don’t waste your dataaaa. Do you think it’s over getting the outcome and report it to the decision maker? Well you maybe right. How about the rest amount of data? You have collected a lot about which profile user is converted and which is not. If you fail to reject H0, we probably should think of other tactics to achieve business objective. That’s when the data collected is soooo valuable, even for further plans. So try to explore the post-testing data also.

Hypothesis testing is both powerful and vulnerable also, so I believe we should understand it the right way to use it in the right way. I’m glad I have finally jotted down this blog to share with you on some different aspects, under an overview sight. Part 2 will talk about multiple testing, please stay tuned.

APPENDIX

Companies that run experiments at scale have a decent idea of the rate of success of their experiments, for instance (based on A/B Testing Intuition Busters):

Microsoft: 33%
Bing: 15%
Booking.com: 10%
Google Ads: 10%
Netflix: 10%
Airbnb Search: 8%

Reference

Using Causal ML Instead of A/B Testing | by Samuele Mazzanti | Towards Data Science

Unlocking the Mystery: Dive Deep Into the World of Hypothesis Testing in Statistics! | by Mirko Peters | Mirko Peters — Data & Analytics Blog

Conducting Actuarial Studies — Part 5: Statistical Inference: Estimation and Hypothesis Testing | by Roi Polanitzer | Medium

Mastering Hypothesis Testing: A Comprehensive Guide for Researchers, Data Analysts and Data Scientists | by Nilimesh Halder, PhD | Analyst’s corner | Medium

Numeracy, Maths and Statistics — Academic Skills Kit

Getting faster to decisions in A/B tests — part 2: misinterpretations and practical challenges of classical hypothesis testing — @aurimas

What is Hypothesis Testing in Statistics? Types and Examples | Simplilearn

Paired vs. Unpaired T-Test: Comparison Chart

Parametric and Non-Parametric Tests: The Complete Guide

Power and Sample Size Determination

T-test sample size calculator, and z-test sample size calculator