A/B… C? Lessons learnt from experimentation

Published in

Reach Product Development

6 min readJul 1, 2019

Experimentation, including the infamous A/B test, is a ubiquitously adopted strategy in digital product development today. This is mainly due to a) the increasing ease of recording, storing and analysing usage data and b) our obsession with using that data to inform and drive our decision making.

For those who aren’t familiar with the practice, don’t fret. The idea has been around since the mid-18th century and you’ve probably experienced a test already as the biggest and best product companies — Google, Netflix, Facebook, Amazon — are constantly testing and don’t ship out new features or redesigns until they have affirmed that they deliver a quantifiable positive change. In a nutshell: instead of delivering one version of your product to everyone, you divide your users randomly into different groups that each receive a tweaked version or ‘variant’, along with a group that receives what you’re testing against i.e. the ‘control’. Visually:

You then analyse how each variant did according to the numbers and integrate the winning tweak into your product if it successfully outperforms the rest. If the control wins, you were probably wrong about your initial hypothesis and need to rethink.

Testing is pretty much a prerequisite for introducing anything new here at Reach and hence is a core part of our product development lifecycle. We’ve used in-house and third-party tools (Optimizely, Google Optimize) to carry out both client and server-side testing of everything from optimisations, such as changing the font-colour for links, to ambitious new features, such as delivering a Snapchat Stories-esque interface for football video content. We’ve succeeded (occasionally), failed (sometimes) and often been underwhelmed by the impact of our ideas. Below are nuggets of learning from this ever-evolving rollercoaster of a journey.

1. Have clear goals 🎯

Don’t even start with testing if you’re not sure what you want to achieve. Unclear objectives translate into murky test criteria which means your test is either inconclusive, unhelpful or potentially quite damaging towards overall product success. There are many frameworks, including the famous OKRs, that help you extract and structure product goals, but even stepping back and asking “Why are we doing this?” at the beginning of a brainstorm session can be enough to steer your process in a helpful direction.

2. Distribute your traffic wisely 🚗

Dividing your user base up is harder than it sounds. Your variants need to be big enough for the results to be statistically significant yet not so big that if your test breaks or underperforms you lose a sizeable chunk of your users and revenue. If you are unsure how severely a test could affect impressions or engagement, start small and increase variant size in a follow-up test. However, if you’re implementing a low-risk change, are willing to incur the potential loss from a high-risk change or if you have a reliable testing setup that is integrated into your architecture , you should feel confident in starting big. We’re lucky to have a diverse portfolio of news websites at Reach, so we usually run an initial test on a smaller site where the stakes are relatively lower. If all goes well and we need to reaffirm our hypothesis with a larger sample size, we roll out to larger sites.

3. **Choose your metrics even more wisely** 💯

There is no point testing if you’re unsure about what your testing for. Metrics are the vehicle through which we interpret user behaviour and hence it’s very important to know what you want to get out of them.

I’ll use an example to illustrate. Say you’re an e-voucher company and you want to increase the amount of shared discount codes. To do this, you decide to run an A/B test on your web platform which has 3 variants, each containing a different design of the share button. The main deciding factor for success is the number of interactions with the share button, so a reasonable primary metric is clicks on that button. Other things you want to monitor include the time spent on the page and maybe whether the user uses the discount code. Along with metrics that are key for your test, it is vital to monitor other metrics that you consider important for product success, because if your test jeopardises those, it might be more harmful than beneficial overall (this is why having clear objectives is important). Here is a good way of looking at it:

4. Watch out for the ‘novelty effect’ 😱

Usually, introducing a change or adding a shiny new feature is followed by a short-lived inflation in engagement. This is commonly diagnosed as the ‘novelty effect’ and occurs because it is new to users and they want to know what it is. It doesn’t reflect the actual usability of that feature and should be discounted from your overall results. Third-party tools such Google Optimize take this into account, but you should watch for it if you’re doing your own analysis. Its occurrence also means that you need to give enough time for your test to settle — you need to let users get used to the change, and then see if they actually engage with it. It is recommended to run tests for around 2 weeks but this can change depending on your variant sizes and how long it takes for statistically significant differences to emerge. Another handy feature in Google Optimize is that it tells you when to end your test.

5. This time, don’t go with your gut 🤖

Quite a lot of the time, what you think will work doesn’t work — you are not your user. As smart as you and your team may be, you cannot predict how your users will behave. We tend to get attached to the things we build, design and strategise and often that makes it hard to let go and evolve at a quick pace. You must wholeheartedly accept what the tests tell you (assuming your analytics are reliable, of course) and make decisions based on that, even if it goes against the outcome that you wanted. Often, even if the results are inconclusive, experimentation is a great way to learn more about your users and how they interact with your product — we are limited by the way that we think and should entertain a bit of data-driven enlightenment.

6. Break it up ⚒️

Don’t try to put too much into one test because it’ll be hard to separate what exactly made one variant win over the other. For example, if you’re doing a redesign of a button but also want to test whether relocating it would lead to more clicks, carry out two tests — one with the redesign and one with the relocation. Although it’s not always this easy to be atomic with your hypotheses, try your best because it allows you to independently analyse the impact of proposed changes. If you did the redesign and the relocation in one test and the results weren’t promising due to one of them having a negative effect, you’re losing out on a potentially positively impactful feature.

And that’s all for now, folks! One of the most exciting things about software is that our users are changing all the time and we are in a position to record and analyse immediate feedback from most of the ways they interact with what we build (using GDPR compliant approaches, obviously). This makes experimentation a lot of fun and I would encourage you to give it a go if you haven’t and share your scare and success stories if you have!

Happy testing 🎉