Unlocking A/B Testing — Part one: set up and execution

Claudia Minardi
NEXT Engineering
5 min readMar 31, 2016

--

If you have worked hard in producing a new feature, you definitely want it to be successful to the eye of the user. But how do you measure the success of your hard work against the preferences of your users? How do you make sure you are in fact making their experience better, and not introducing something harmful?

You might have heard somewhere that A/B testing is the way to go, but to you it just looks like a black box. You have no idea how to get started, what you need, and what will be the result!

Let’s take a look at how we did it!

What’s A/B Testing?

We developed a recommender system for a set of items, that has been evaluated against a simple recency based approach: the most recent item is shown first. Which one is the more effective? Did we improve the situation at all or did we make it worse?

We set up a controlled experiment, in the form of an A/B test, described by Kohavi et al. (2009) as the randomized assignment to the user of two variants: the control - typically the existing version, which in our case corresponds to a simple ranking by recency — and the treatment , which is usually the new version being evaluated — in our case the recommended items. Once the experiment is set up, we collected metrics for evaluation.

Be careful here! The difficulty in evaluating the impact of a change lies in choosing the correct success metrics: these could be based on increased user traffic, exploration of new items, responsiveness of the system to the user’s preferences, and so on. As Shani and Gunawardana (2011) suggest, a feature needs to be evaluated in the context of a specific application, by identifying for that context the properties that may influence its success.

Once we have all the metrics, to determine if there’s a statistically significant difference between the control and the treatment, we have to perform statistical tests: widely used for the evaluation of A/B tests is Student’s t-test, used to determine if two sets of data are significantly different from each other, under the assumption that the data used for evaluation is drawn from a normal distribution for a large enough sample (Central Limit Theorem).

Experimental Settings

Now that we know what A/B testing is and how we are going to evaluate it, let’s take a closer look at what we need to set it up.

Be careful, because experimental settings can be boring and hard to figure out, but if this step is not done right, the whole experiment is going to be useless!

Fundamental ingredients you need to have before starting the experiments are three, according to Kohavi et al.(2009) and Ranjit (2001):

  • Overall Evaluation Criterion (OEC). Also known as the dependent variable, it’s a quantitative measure of your objective. For example, are you testing whether a blue button in your user interface attracts more attention if colored red? How do you measure this ‘attention’? Your OEC could be the Click-through Rate: of all the times this button has been viewed, how many times has it been clicked?
  • Factor, as in the variable that influences the OEC. In A/B tests there is only one factor, and its values — variants — are A and B. Take the button example again: the factor will be the button itself. The blue colored button will be your variant A (called control), the red colored button will be your variant B (called treatment).
  • Experimental Unit, that is the entity over which metrics are calculated. In simpler words, it’s what you are observing as the experiment goes on. Let’s go back to our blue/red button example, and let’s say that the button is on the homepage of your website. You won’t be interested in what happens on any other page, because the button isn’t there. Your experimental unit can be a page view on your homepage.

So, what changes when the test is on something more complicated than just a button? Let’s take into consideration what we did to test our recommender system.

Since the success of a complex algorithm can be driven by multiple factors, it’s good practice to use more than one OEC; in our case we went for commonly used metrics in Information Retrieval:

  1. Satisfied Click-through Rate: the amount of clicks that resulted in satisfaction from the user’s point of view.
  2. NDCG (Normalized Discounted Cumulative Gain): the usefulness of a document based on its position in the result list.
  3. Time to click: time passed between the start of a session and the click on a recommended media item.
  4. Mean reciprocal rank: the harmonic mean of the inverse rank of the items that have been correctly predicted

Our factor is the Learning to Rank algorithm that backs the recommender: its A variant (control) is plain rank by recency — last document published shows up first; its B variant (treatment) is our supercharged implementation, that shows first what it thinks is best for the user.

Our experimental unit — also referred to as impression — is a page view, on all the pages where the recommended items are visible to the user.

Now we’re all set up and ready to go!

Execution

Good news: this is definitely the easiest part of your experiment. There are only three best practices you want to follow, according to Shani and Gunawardana (2011):

  • It is important to sample users randomly, so that the comparison between A and B is fair;
  • It is important to single out the different aspects of the recommender (e.g. if we are interested in evaluating the impact of user interface, keep the same underlying algorithm — and vice versa);
  • Experiments need to be carefully planned, as they can have a negative impact on the system which might be unacceptable in commercial applications.

Did you check all three of these best practice off your TODO list? You’re good to go! Release your variant A over your control group, your variant B over your treatment group, and start recording all the data that you need.

You’re almost done! Now that the data is being collected, you just need to figure out what to do with it. Only two more steps left to to complete you tests: power analysis and evaluation. We’ll guide you through them in the second part of this journey, coming soon!

Happy coding!

--

--