Running a Great A/B Test: 6 Ways how NOT to do it

Madhawan Misra
AirAsia MOVE Tech Blog
6 min readMay 19, 2021

At AirAsia, our number one priority is to treat our customers with the utmost importance, which includes ensuring that our products maintain the simplicity, usability and intuitiveness that users love. As a growing organization, we often focus on a combination of business and product goals. These goals are tied to our long term vision of serving our users with the best possible features. However, our thinking and hypothesis behind these changes might not always align with our user base, so how do we ensure constant alignment?

Image credit: Convertize

We at AirAsia embrace experimentation and believe in the classic slogan -” Seeing is Believing”. As the old-timers will say, I am not believing your hypothesis unless you prove it, and yes, we adhere to the same philosophy.

In our previous sprint, we ran an A/B test to challenge our hypothesis — without letting out too much of our secrets, we wanted to find if our proposed design changes on our app home page will bring the desired effect, without influencing any counter metrics. At this point, if any of you are puzzled by the term counter metrics, then let me explain this jargon.

A counter metric is something that you measure to ensure that you haven’t over-optimized your north star metric to the detriment of your customers and your business. Let’s take an example of YouTube, the north star metric for YouTube ideally should be the average watch time since it directly relates to engagement, which affects growth and revenue. Think of a situation where there is indeed a very high watch time, but only across a specific genre of videos. Even worse, let’s say the watch time crosses the number where your users start ignoring their calendar invites and email or for kids, their moms start asking them to uninstall YouTube. Another example could be investing in a flashy banner that might increase CTR but if the landing page is not optimized properly then the conversions and hence revenue might fall.

One simple A/B test where we examined impact on conversion, click-through rate and purchases. Deeper analysis included testing for certain micro-metrics

And that explains the title of this write-up, How Not To run an experiment.

  1. Running an A/B test without a hypothesis — Far too often I have seen teams run an A/B test for the sake of measuring how well the control and variant perform, without a base hypothesis. Your A/B test might be accurate but unless you have a hypothesis to prove right/wrong, running an A/B test is fruitless. The important findings from an A/B test aren’t limited to who did better but rather, Why did the Variant/Control do better? And what next? Are there hidden interpretations? The list of questions can be endless, but not until you have a hypothesis.
  2. Measuring the wrong metrics while running an A/B test — This is classic, you have a hypothesis, your hypothesis is proved right/wrong and you roll out. Only to find out all this time you were measuring a vanity metric. For example, focus on revenue before engagement or trying to find the number of sessions without focusing on the conversion or click through rates. This results in a flawed setup, leading to you validating/invalidating your already flawed hypothesis.
  3. Overemphasizing the Primary/Macro metrics — Far too often an experiment is assumed to yield an uplift to the bottom line, i.e. the macro conversions (purchase, repeat purchase etc.) — while these may be true in some cases, the reality is there’s an array of variables that could affect these macro metrics, and consequently the experiment is left for quite some time until it reaches a statistical significance, or even any significant change in the conversion rate at all.

As in the case of our recent experiment of changing the layout in homepage, the primary metric that we had hoped to measure is the impact on conversions, and much to our disappointment the conversion rate remained flat. However, what we later discovered was a massive delta in the behavioral metrics (which was an after-the-fact lesson) in that a certain variant recorded a 30% higher tile click through and the statistical significance is reached almost 2x as fast compared to that of macro conversion. This mean that if we didn’t diverge away from the macro/primary metric, probably we would have called this test as non-conclusive and moved on. Luckily, we observed the other metrics and were able to draw important conclusion related to user click-through rates.

Lessons from this; diversify your metrics, set macro & micro conversions and don’t be ashamed if your experiment doesn’t affect purchase behavior!

4. Running an A/B test with a preconceived notion - “I am running this test to prove to all my haters wrong; my fancy new hero banner will lead to a better CTR, and if I fail, then I am going to show that despite my lower CTR, our DAU grew (:D).”

If you are running a test just to prove yourself right, then you are approaching A/B testing with a wrong mindset. A/B testing is to validate /invalidate your hypothesis, and it requires courage to admit you went wrong. A/B test works well with teams who believe in the “ Fail Fast” approach, and this is where an organization that promotes — “Failing early is good” — can reap the benefits of experimentation.

5. Reaching conclusions far too early — I am not a data scientist, but you surely want to run an A/B test that is significant enough to rule out any false positive/negative. I am not going to explain what is statistical significance/ MDE/ P value etc. because they might be beyond the scope of this write-up. However, it is important is to know beforehand how much time you can devote to an experiment, the minimum difference you want to achieve, and the amount of traffic you wish to allocate. There is a ton of information available to ensure you do not call a statistically insignificant test a conclusive one. Two of the online calculators that I used were Evan Miller’s online calculator and survey monkey.

In the above picture (actual result of our A/B test), don’t let the 2% confuse you. Take a look at the 85% confidence, it might sound like a good number, but is it statically significant to call your Variant A a winner?

6. Not trying to find insights after a flat A/B test — We all have gone through this, we start with our hopes high, collect data, can’t wait to draw conclusions and boom! The test is flat, statistically insignificant and inconclusive. What do you do next? Move on and stick with your control? Blame yourself for coming up with a wrong hypothesis? These things are bound to happen, but if you are willing to go deeper, there are always insights present that can be super useful. Even our test was pretty flat; however, we decided to use counter and micro metrics, leading to deep insights. Perhaps the primary metric wasn’t affected by either of the two patterns, but some counter metrics were. It was only through deeper analysis that we were able to reach this conclusion. So, don’t fret when you have a flat A/B test. Instead, try digging deeper or perhaps ask why? (I am a die-hard fan of the “5 whys” — asking why five times will yield the root cause of the problem) .

My mantra in experimentation is to firmly believe that I could be wrong and to tell myself that it is always better to believe in data than rely on a gut feeling.

These are the insights I wanted to share from my experiences running A/B testing. I hope you all will be able to run better and meaningful A/B tests after reading this!

--

--