AB Test Call-out framework
Calling out an AB test result is not straightforward. We still don’t have a single model or formula that can make the test call-out brainless without impacting decision quality. The wait for AI revolution continues!
Standard T-test on key metrics alone is not enough to make decisions. It is because your use case may not meet all the assumption for the test. It is also because it is easy to misuse it. There is a reason statisticians have started debating on whether it is a time to move past it.
In your day to day work, you must have come face to face with below dilemma
- Your initiatives give you a lift of 1%. The dollar value of the lift is in millions of dollars. The only problem is that the north star metric is not statistically significant (74%).
- You tested a product feature. The benefit is 2%. 97% statistical significance. You push to production to realize the benefit is not there. Do you look at it as limitation of probabilistic decision making? The fact is, a lot is going on here that you may have missed. Experience fatigue could be a factor. But, It could be the result of a not so robust analytical framework for test readout. This piece would pass thoughts on only analytical framework.
Many model approach is a handy tool to navigate us through these decision quagmires. Use of many models compared to a single model improve decision quality. One model makes up for the gap in another model.
Here is how we can think about it in our context. We’ll unpack each of these categories one by one.
1: Statistical Significance:
We need to have a user-friendly platform that has statistical significance on all relevant metrics. Two points worth keeping in mind here.
1. Not only we should have stat sig handy for our Northstar metrics, but also for all other metrics that lead to Northstar metrics. For example, let’s assume we need to make a call on launching a product feature for Amazon. We want to have an impact on Sale volume. For this use case, we must break down sale volume into its components and do statistical significance on those along with Sale volume. The component may look like below
Breaking down metrics into its component improves the quality of decision making. It gives us more confidence in the choice of the decision also. Here is why it helpful 1: We get to understand the drivers of the movement in the Northstar metric. 2: Some metric would meet the assumption of the t-test criterion better than others. 3: Some metrics achieve significance faster than others.
2. Sequential testing is a little known statistical method that deserves more attention. It is not pragmatic to expect that we don’t look at the experiment result before stipulated duration and sample size. Peeking affects decision quality. Sequential testing is a way to avoid peeking problem.
We are working towards integrating this into our ‘ decision-making framework for experiment call-out’. We’ll have a separate post on it later.
2: Growth Model Consistency (GMC)
Philosophers may call it epistemic coherentism. For us, it means that when we see a test result, we must see it in the context of our overall understanding of business. Reforge’s analogy is spot on when they say, “Your business is an ecosystem. Your metrics are the key species.” The health of an ecosystem depends on all the species and the complex relationship between those species. Likewise, the health of a company is the result of a delicate balance among different elements. We must strive to understand the relationship between all the elements of our business.
If the result doesn’t fit into our understanding of user behavior and the ecosystem, it is most probably a noise. We are okay to go ahead with the launch for a statistically insignificant movement in Northstar metrics if the result fits into our understanding of the ecosystem.
Here are a couple of concrete steps that will help us to understand GMC in action
- Spend time to understand your ecosystem and its components. Work on a growth model for your product. It is a lot of work, but it is worth the time. It will make the team smarter in coming up with a hypothesis. It will also make decision making less resource and time intensive. Not having a growth model is not an excuse to ignore GMC for experiment call-out. Moreover, when we start analyzing experiments with GMC in mind, it helps us develop insights that will help us develop and refine the growth model
- While investigating the results, think of user flows. If users need to take two actions before making the final action, we must see proportionate movement in those actions for both test and control
- Segment. Look at the numbers and think why? You know that the product feature that you launched helps repeat users more than new users. Do you see it in the result? If not, does it make you change your understanding of your new-repeat user behavior? Either you need to update your understanding of user behavior, or the results that you see for this test is noise.
3: Qual Data
We all have heard million times that quantitative data tells us the answer of ‘what,’ ‘when,’ ‘where,’ and ‘who’ questions, and qualitative data tells us answer of ‘how’ and ‘why’ questions.
‘How’ and ‘why’ are equally important questions. We can’t do an excellent job on understanding the ecosystem or developing a growth model if we don’t understand the motivation and need of our user base. And, if we don’t have a good understanding of the ecosystem we will do pretty ordinary job on GMC. Without a handle on GMC we may never be able to navigate through the dilemma we sought to resolve in the beginning.
4: Change in Business priority/product strategy
Shit happens. For a growing business, It does happen more than we may like to admit to. You may learn that you are not allowed to launch the experiences for all users. At the last moment, you are told that you are required to change the messaging as per legal feedback. A change in business priority or new business constraint is nothing but a change in the test design and the success criterion of the experiment.
A good understanding of the ecosystem of business gives us the ability to navigate through the business constraint.
The experiment call-out strategy for business in flux is to have a good handle on all the components of success metrics, and how it got affected by treatment variables.
Sometimes you may like to go back and experiment again. But in some cases, you could launch the experience for specific segments. For the other segments, you may keep only the treatment variables that legal is okay with. But, please have ideas to influence the users down the funnel with alternative messaging too.
Resolving the dilemma
To go past the dilemma presented in the beginning, we must resort to many model thinking. We must look at experiment results from different vantage points.
For dilemma 1, if we see positive lift across the ecosystem even when Northstar metric is not significant, we can go ahead with the launch.
For dilemma 2, once we look at the lift in the context of the ecosystem, we will realize that we can not attribute the whole 2% lift to our experience. The lift may have come from users who are not exposed to the test directly. It may have come from users who don’t get direct value out of the test.
Take-aways for high-quality decision making
As an organization, to make high-quality decision making a norm rather than a one-off event, we must look to productionalize the framework. We must work hard to make all the relevant information available at the touch of a keyboard. We must present it in a way that even an inexperienced user can make sense of it.
- Stat Sig: We not only automate stat sig data for north star metric but also for all other leading indicators of the metric.
- GMC: This data piece is contextual. But it is not hard to come up with the necessary data point, funnel, segment and user flows we must look at for each section of the product. We should work towards automating it and get those relevant views ready alongwith ‘stat sig’ data on success metrics.
- Work towards a growth model for your product or your organization. Champion the cause that all tests should not look to move north star metric. We can break north start metric into components, and test ideas that attempt to move the needle in those metrics. We not only work for northstar metric optimization but also for Funnel optimization or Loop optimization.
- Be more thoughtful on defining hypothesis and success criterion. If you have a good handle on growth loops and define your success criterion, you may not need to do a lot of work post-test launch for GMC. An hour of work on a hypothesis, test design, and success criterion is worth 10 hours of work post-test launch.
- Try to make relevant qual data available side by the side of the test result. It makes folks more thoughtful before they are making decisions.