The AB testing cookbook

Ibtesam Ahmed
4 min readAug 16, 2023

Wondering why the title? Imagine this as your recipe book that guides you step by step through the world of A/B testing, much like a cookbook for culinary masterpieces. Just as cooking blends ingredients to create the perfect dish, A/B testing combines science and business understanding to make the right decision.

While the world is talking about ChatGpt(even 9months after it was released), I wanted to shed some light on AB testing, you may ask why? Well, either AB testing is not as common a practice as it should be or it is not being done correctly. While big tech companies and succesful tech oriented startups have got it sorted(Microsoft and Google each conduct more than 10,000 A/B tests annually), most mid-sized companies or non tech-oriented startups struggle with it.

Building an AB testing culture

Although AB tests take time compared to intuition based decision making, it is a scientific and an evidence-driven process of making decisions. I will quote a real incidence from an amazing article I read about the power of these online experiments.

In 2012 a Microsoft employee working on Bing had an idea about changing the way the search engine displayed ad headlines. Developing it wouldn’t require much effort — just a few days of an engineer’s time — but it was one of hundreds of ideas proposed, and the program managers deemed it a low priority. So it languished for more than six months, until an engineer, who saw that the cost of writing the code for it would be small, launched a simple online controlled experiment — an A/B test — to assess its impact. Within hours the new headline variation was producing abnormally high revenue, triggering a “too good to be true” alert. Usually, such alerts signal a bug, but not in this case. An analysis showed that the change had increased revenue by an astonishing 12% — which on an annual basis would come to more than $100 million in the United States alone — without hurting key user-experience metrics. It was the best revenue-generating idea in Bing’s history, but until the test its value was underappreciated.

There are many such stories and stats.

If you are a product based company in the online space and have a sizeable user base, any new feature before being rolled out to all of your users should be AB tested to know whether it’s doing you good or not. Roughly speaking you should show one set of your users the new version of the app and other set of users the older version(keeping everything else constant) and measure the difference in metrics that are important to you(it could be conversion or ctr etc). Not just the changes in the app should be tested, the marketing strategies like email campaigns, push notifications should also be tested to reach out to the customers more efficiently.

In the companies that use AB tests heavily, it is used by Data Scientists, marketers, designers, software engineers, product managers amongst others to not just make more user-driven decisions but for also conducting complex experiments to study network effects which basically help in understanding how online services affect user behaviour offline and how users influence each other. As you might have already guessed by now, these type of network effects tests are mostly done by social media companies such as Meta.

Although AB testing can be done even for the most minor change in your app, it should be specially done before scaling up ML models in production. A Data Scientist’s job is to not just make ML models but to also test them online against the baseline/existing approach. Offline testing is not enough because in most cases the training environment would not be a perfect representation of the production environment.

Decoding Statistical significance: The right way to run an AB test

You might wonder, “Isn’t A/B testing just about dividing populations and comparing outcomes?” Not quite. The difference you see when you compare the results directly between variants, might be purely due to chance. How can you be certain that the difference you saw was actually real?

This is the point were statistics(read: the science of quantifying uncertainty) enters. Statistical significance, symbolized by p-value gives you the probability of you being wrong when you say there is a difference between the two groups. If this probability is less(standard is <5%), you were mostly right and the difference is real.

As a Manager or as someone from business, if you are still not convinced how statistical significance is important for you and why you should know about it, I’d encourage you to read this Harvard Businees Review article.

How to run these tests?

Some leading tech companies have dedicated entire groups to building, managing, and improving their experimentation infrastructure that can be employed by many teams not just Data Scientists. Such a capability can be an important competitive advantage, but the companies this article is meant for won’t start with building their own frameworks right on. Only when AB tests are done at scale and add a lot of value would this make sense.

Until then, companies either buy AB testing tools or Data Scientists/Engineers develop AB testing pipelines. In either case, you would want to know the steps you have to go through when doing a test and the different statistical concepts associated with them like how to formulate a hypothesis, how to calculate sample sizes, what is statistical significance and confidence interval, which statistical test to choose. In this series, I aim to cover those.

If you are interested to learn more, here’s the link to the next article.

--

--

Ibtesam Ahmed

A Data Scientist on a quest to make machines smarter and more human-like. Avid Reader. Moody Writer. Amateur Cook.