AB testing for better decision-making

Learn why we use AB testing to help develop our games with data-driven decisions and how we go about it

etermax tech
etermax technology
7 min readAug 19, 2022

--

By Magdalena Sella, Data Analyst at etermax

In recent years, everyone has been saying that you have to harness the power of data to succeed and that companies should become data-driven if they want to survive. However, have you ever wondered what being data-driven looks like, and what the hype is all about? Is it only about machine learning and AI, or is there more to it?

There is undoubtedly much more to it. This article dives into AB testing, one of the most used tools, not in a technical way, but from a product perspective. Of course, you need to get the basics of the technicalities, but what I’m trying to get at here is why you should care about AB testing, and how it helps businesses and enables better decision-making.

Being data-driven, in my opinion, is all about asking the correct questions and then looking for data that either confirms or disproves your hypothesis. If it sounds sciency, it is because it kind of is. To be sure, there are several stages to analysis and you may not need to begin with a hypothesis; instead, you may perform an exploratory analysis that lets you get a sense of what you are looking at. But the question is always there.

Consider the following scenario: You believe the new gameplay for the game is fantastic (and not only because you helped design it). But get this: the old one, according to some coworkers, was superior because more people completed it. How do you choose which one to keep, then?

Once you start searching for answers, perhaps the evidence tells you that your gut was right, or perhaps you reach some counterintuitive insight that you don’t like. But then it doesn’t matter if you don’t like it, what matters is that the team can make decisions based on facts. What’s the alternative? We can always ask our management which one they prefer, but it probably won’t get us where we want to go — making the best product for our users. And that’s what AB testing is all about.

In product development, it always comes to a point where you ask yourself: well, which alternative is better? Should we further develop A or B? Which one brings more engagement? Which one brings more revenue? And with that question in mind, you can form a hypothesis and start hunting for answers. But for that, there is usually no data.

Wait a second. How can we harness the power of data without data? What happens when you don’t have it? Well, we create it. Don’t get me wrong, we don’t create it as in making it up, more as in collecting it. And that’s when AB testing enters the picture.

AB testing, then. We collect data, you say? Exactly. AB testing is useful when the question we want to answer is which alternative is better. The metric you use to measure the “better” one depends on the business case. Turnover, revenue, CTR — you name it. The strategy is the same.

AB testing in a nutshell

It all begins with a hypothesis: “alternative A is better than B, and it will raise metric X by 5%”. The product team then develops alternatives A and B (or perhaps only B if A is already in production). Additionally, you have to decide how many users will enter the experiment (more on this in a bit). You show alternative A to half of them and alternative B to the other half. Finally, you check which group did better. Easy, right? Now that we understand the overall concept, let’s break it down a little bit.

Who is involved?

Well, many people are involved in an experiment. First of all, someone has to come up with a question or a hypothesis. This can come from anyone involved in a project, but it usually comes either from the product owner (PO) or the data analysts.

Then the two (or more) alternatives that will be tested must be built up by the engineers and the art team (since we are a mobile game firm).

The users can now engage with the options and give feedback.

Lastly, it comes the analyst’s turn. It’s their job to decide which alternative did better and should be continued, or if the experiment was non-conclusive and should be iterated.

How do we decide which one did better? What metrics do we look at?

Well, since we are in the gaming industry, there are some KPIs that we constantly strive to raise. However, individual businesses may have different goals. Retention, stickiness, and revenue are the key performance indicators for us.

However, one of these target measures can rise while negatively affecting another. Because of this, we must continually be vigilant in looking for unintended consequences of the experiment. Did the retention drop but the revenue rose? That’s not good, although maybe it is; you need to look at the overall picture.

Checking for improvement

So, let’s say we target a metric and see if it performed better for one group than the other. Yet how? Imagine that we intended to increase revenue with the new gameplay and that one group earned us $1,000 while the other generated $1,050. Does this imply the second group performed better? No, not always.

There is this thing called statistical significance that will beg to differ. We must achieve the appropriate statistical significance before claiming that A performed better than B. What does that mean? It means that we are reasonably sure the results did not happen just by chance, but that there is an underlying difference in the behavior of the groups.

And how do we check for statistical significance? Well, there are two main steps to it. Defining the sample size (how many users) you need in your test, and then comparing the metrics for the two groups.

Step 1: Sample size

You must run your experiment on a sufficient number of users before anything else. Typically, testing it with only 10 participants per group is not helpful. The exact number will depend on the metric you want to target (yeah, again with the metric thing), and how much improvement you need to see to justify your efforts.

If you want to see a 50% increase, then perhaps you can test it with 1000 users per group. However, that will probably not be enough if you want to check for a 5% increase. And if you want to see a 50% increase from 20 to 30%, you will need fewer users than if you wanted to check for a jump from 2% to 3%.

It sounds complicated but there are many calculators out there that can help you figure out your sample size. The general idea is this: the smaller the impact you are looking for, or the smaller the value of the metric you want to impact, the greater the number of users you will need for your test.

Word of caution: if you do not have a big enough sample size, you won’t reach statistical significance even if there is an underlying difference! This is to say that if 10,000 users were needed and you only got 6,000, then whatever conclusion you reach won’t be solid, since the difference you’re looking at could just have happened by chance.

Step 2: Testing for Statistical significance and analyzing the results

So, we did it. We laid out a hypothesis, prepared everything for the test, defined the number of users that we needed, tagged all of them, and now we have the data. We see a difference, but since we paid attention we know that this can happen by chance. How do we know this isn’t the case? Now comes the second step, and, once again, statistics come to the rescue.

There are statistical tests specifically designed to tell you if the difference you see depicts a real underlying difference, or if it could be the result of mere chance. Once again, there are calculators out there that can help you with this.

If test results indicate that there is no statistical significance, what should you do? What’s happening here is that the difference you’re observing is not significant enough to prove that one alternative performed better than the other, and that’s what it’s telling you, despite the fact that the math suggests otherwise ($1,000 vs. $1,050, remember?). It boils down to the fact that if you were to repeat the experiment, you could see a different result because it could have happened by chance alone.

A final word

Being data-driven nowadays is key to developing products that can succeed in an overflowed market. AB testing can get you closer to that. but it is crucial to ask the right questions and to run the right experiments to get the most out of it.

If we want to increase retention, perhaps focusing our energy and resources on testing the shop’s layout is not the best way to go. Analyzing when we are losing more users and testing alternatives there may be more productive.

So, what’s the key takeaway from all of this? AB testing is not just another tool in our toolbox. It acts as a compass, pointing you in the right path. And you can get the most out of it by following these steps: define an experiment, test, analyze and repeat. Only by doing this, you can be certain that the changes you are making to your product are getting you closer to where you want to be.

--

--