In this article, I will show you how A/B Testing is carried out in a planned and controlled manner with a real-world example.
(Quick note: You might want to check out Part 1 of this tutorial series which explains the basics of A/B Testing and the key terminologies used.)
Identify where there is the most opportunity:
In any business, there is always room for experimentation and improvement. In fact, there will be plenty of such opportunities. Does that imply all those should be worked upon? Let me remind you that A/B Testing is time-consuming and has costs associated with it and hence a good ROI (Return on Investment) is a must to convince the higher management into this. For instance, if you are trying to increase the sales of your e-commerce site and for that, you are buying an ad-space (for say $1000) and see an increase of $250 in sales. Is that effort worth the spend and experiment? The next question would be ‘How do we decide or identify the right opportunity?’
Try starting from any one of these:
- The top-level goal: which is the prime objective of your organization such as the number of leads generated, revenue generated, etc.
- Dig into performance metrics: Investigate your operations and check how well the business has been performing recently across various functional areas and divisions. Start with the one which requires the most attention (need of the hour).
- Do user research: Gather user inputs and feedback on how they like the services provided and what kind of changes would make their experience better.
- Do market research: Check how your competitors are performing, their new products and services.
Know your Business and Customers.
Before you begin an experiment, be very sure about how your business currently performs. For instance, if you are running an e-commerce site, you need to have an understanding of the following metrics (not limited to).
- Website Traffic
- Traffic source
- Revenue per Traffic Source
- Average Order value
- Customer Lifetime Value
- New Customers and Returning Customers
- Conversion Rate etc.
Select your target metric:
While designing an experiment there could be 2 or more metrics that we would be interested in
- Metrics that are directly impacted by the experiment
- Metrics that are at the bottom of the funnel (our top-level goal)
Let us consider an example of making the category menu of our e-commerce site more prominent which would directly impact the number of interactions made by the user. However with increased engagement, one would expect a higher revenue. It’s always better to choose a metric that is directly impacted by the experiment, as choosing our top-line goal might lead to incorrect inferences or in fact, might need a longer duration to execute. In general, the various stages the customers go through on the site are treated as a funnel, and an example of it shown below.
Define your Target Audience:
While performing experimentation, it is highly essential for us to be more specific about our target audience. For instance, In our e-commerce site example, we might want to define the specific characteristics like geographical region, type of user (new or existing users), age group of users, etc. of the sample of visitors for the experiment to be more impactful.
State your hypothesis
A scientific hypothesis is just an idea or a proposal that can be tested scientifically. It is something more than a wild guess and less than an established truth. The Hypothesis or Idea must be very clear and concise for the experiment to be successful.
In our case, ‘Revamping the Category menu of our e-commerce site could lead to an increased conversion rate (Directly impacted metric)’ could be one way to start.
Just for the sake of better understanding our target metric selection, let us also test if ‘Revamping the Category Menu leads to increased revenue per customer (Our Top line goal)’.
Testing Conversion Rate Hypothesis with Chi-Square Test:
As shown below, we check for statistical significance using Evan’s Awesome A/B Tools Chi-Square Test for the number of clicks made by users. Why the Chi-Square Test? Recalling from the previous article, since we are dealing with clicks made by the user which is either a click or no-click event (boolean outcome) we make use of it.
We see that there were a total of 300 clicks made by users from the control (sample 1)and 350 clicks made by users from treatment (sample 2)groups respectively. We see that with 95% confidence, the conversion rate of users from the treatment group is sufficiently higher (p=0.017). Ok, what is this p-value here? To know that, Just be a little more patient and keep reading!
Testing Revenue per user Hypothesis with 2-Sample T-Test:
Estimating the Sample Size:
Suppose you get 1000 visitors for your e-commerce site every day. Out of those 1000 visitors, 200 interact with the category menu and the conversion rate for the same is 40%. The conversion rate of the remaining 800 users who don’t make use of the category menu is 20%.
The Overall Conversion Rate of the site is calculated as 24%. Since our Hypothesis is on the number of visitors who make use of the category menu, we can expect a lift on the conversion rate of those 200 visitors only. Let’s say we are anticipating a conversion rate of 80% as a result of the UI change. Our new conversion rate summary looks like:
We see that the overall conversion rate has increased to 32% from 24%. The percentage difference can be computed as Absolute and Relative difference.
Absolute % Difference = 32–24 = 8%
Relative % Difference = [(32–24)/(24)]*100 = 33.33%
Why do we need to calculate these differences? To calculate the size of the sample that we would require for our experimentation process and here’s how you use the percentage difference to calculate the sample size using Evan’s Sample Size Calculator.
As we can see, we get the same number of samples while trying with both absolute and relative % lifts. By the way, don’t be bothered about the column chart or any additional detail that you see on the page. However, we need to have an understanding of what significance level means.
Mathematically, confidence level and significance level add up to 1.
confidence level = 1-significance level
(or) significance level = 1-confidence level
So a significance level of 5% displayed on the Sample Calculator (as shown above) implies that the confidence level is 95%. Anyone who has a little bit of introduction to the hypothesis testing must have come across this phrase.
If the P value is less than your significance (alpha) level, the hypothesis test is statistically significant.
Back to the cricket example but in this scenario, there are 4 friends (say A, B, C, and D) playing together and wants to decide who should keep wickets. Each of their names is written on a piece of paper and put inside a bowl and randomly one piece of paper is drawn out of it to decide who is gonna keep wickets.
Let’s say in the first three matches, A never got to keep wickets as the paper with his name was never drawn from the bowl. The question is,
Did that happen by chance? Or is something not right?
Our Null Hypothesis or Status Quo is ‘The draw of names made from the bowl is a fair event’
Let us test it out by using probability.
P(A not getting chance to keep wickets) = 3/4
P(A not getting chance to keep wickets in 3 matches)
= (3/4) * (3/4) * (3/4) =27/64 = 0.421
From the above probabilistic calculation, we see that there is a 42% chance of A not getting to keep wickets at all in the first 3 matches which seems fair enough.
What if A never gets to keep wickets in all 10 matches played? Can it also happen by chance?
P(A not getting chance to keep wickets in 10 matches)
=(3/4)¹⁰=0.056 which is 5.6%.
Assuming the Significance level as 5%, we still conclude that A not getting to keep wickets in all his 10 matches is by chance and is fair enough since p-value (5.6%) > significance level (5%).
if A never gets to keep wickets in all 15 matches played then
P(A not getting chance to keep wickets in 15 matches)
=(3/4)¹⁵=0.0133 which is 1.3%.
In this case, the chances of A not getting to keep wickets doesn’t seem like happening by chance since p-value(1.3%) < Significance level (5%). Hence we reject the Null Hypothesis.
Significance level, in general, is a threshold set by the statistician to draw conclusions from the Hypothesis Testing. A threshold of 5% is arbitrarily defined for most cases and for few cases a significance level of 1% is also used.
the p-value, on the other hand, refers to the probability of obtaining test results at least as extreme as the results actually observed. In our example, the probabilities of player keeping wickets we calculated are p-values.
How long should we run the experiments?
For our example, we calculated the sample size required for each variant (Control and Treatment)to be 460. Since we know that the number of visitors to the site every day is 1000, does that mean running the test for a day would suffice?
Often, the number of visits to the site varies from each weekday and one has to be aware of this trend. Hence the experiment has to be carried out for a week at least to reduce the effect of variability caused by weekdays. It is also crucial to ensure that all other factors surrounding the control and treatment groups remain constant. That is, there shouldn’t be any promotion events or big sale events happening during the time of experimentation as it would introduce a lot of probabilities.
Understanding the right usage of A/B tests:
Once the experimentation process is complete, checking for statistical significance is made simple using Evan’s Awesome A/B Tools: Two-Sample T-Test, since we are experimenting on revenue generated which is a continuous value. You can either paste your raw sample data or the summary of it by selecting an appropriate option in the tool. In our example, we have revenue generated by each user in both control and treatment groups, all we have to do is check if their average values differ significantly.
As shown above, the mean, standard deviation, and count(number of samples) for Sample 1(Control) and Sample 2(Treatment) are given as input. The hypothesis selection d=0 implies that our Null hypothesis or Status Quo is ‘There is no significant difference or the average of both samples is equal’.Once we have all the inputs in place, we will have the verdict (which is ‘No significant difference’ in our example).
As discussed earlier, it is always best to select the metric that is directly impacted by the change as a Target metric instead of going for our top-line goal. If at all you want to experiment with the top-line goal only, you may need a larger sample size to get anticipated results.
Remember, Statistics is just a tool for gathering evidence and not the ultimate truth.
“There are three types of lies — lies, damn lies, and statistics.”
― Benjamin Disraeli
Even after you have conclusive evidence on your new feature or variant performing significantly well, you would still need to devise a proper execution plan to go further. Review meetings with cross-functional teams and having decision plans place would also help in eliminating bias during the decision-making process.
I know this might be a lot of information to take in but, we haven’t discussed in detail what a Chi-Squared test or 2 Sample T-Test is for a good reason. The main purpose of this article is to educate professionals who are not well versed with Stats to leverage A/B testing for making better business decisions and validating the same.