How A/B Testing Can Save Your (Company’s) Life

Karlina Oktaviana
Tokopedia Product
Published in
6 min readSep 13, 2019

When it comes to defining the right objective of Product Development, various data points including quantitative and qualitative research is needed to ensure we are heading in the right direction. With the dynamic user response and behavior, measuring the effectivity of our products can be tricky. Then, how we as a Product Manager measure we ship the right product to majority of users?

That’s when A/B testing take a vital role.

On February 2018, Instagram was testing their feature to send notification when user take a screenshot of Instagram story similar to the one in Snapchat. But, the majority of Instagram users didn’t even know that this feature was once exist on Instagram, as we were not their sampling population for testing. If we cannot see the feature today, it is most probably because they decided to cancel to ship the feature to 100% users after seeing the bad results from the testing.

Instagram notification after you took screenshot of Instagram story

This testing method is called A/B testing. It is where we isolate control and experiment segment consisting of random population to test two or more variants, using statistical analysis to determine which variation performs better for a given success metric(s).

Imagine how big the loss and complain would be if this feature went to all Instagram users without testing. To this point, testing result can be the ultimate last layer for decision making whether to continue ship a feature to all users or not.

In Tokopedia, you might have once seen a feature that no longer exists or has been heavily improved by now. Those features might have a bad result in the A/B testing, and so we did continuous iteration to provide the best browsing experience for users, to fulfill our #1 DNA: Focus on Consumer.

The Importance of A/B Testing

On the left graph shown above, we might think that a feature delivers impactful result when we compare the before vs after deployment metrics, without considering external factor of Big Promo after deployment date that have significant contribution to the metrics growth.

With A/B Testing, we can isolate the sampling population on the same testing period to get the accurate impact of a feature. On the right graph, we can see both Control and Variant have increment pattern due to Big Promo, while the real impact of new feature are shown on control vs variant.

Executing A/B Testing

In Tokopedia, we do sets of continuous experiment in weekly basis (yes, even when we don’t have any new features to be released that week, we still do sets of optimization experiment), to ensure that all users will always have the best experience on our apps.

The continuous experiment need to be followed by a smooth execution to get accurate results. Here is a real-life example on how we execute A/B testing in Tokopedia:

What:

  • To execute an A/B testing we need to define the right problem. Example: Among all Flash sale funnel, Product Detail Page (PDP) has the highest drop-off rate.
  • After defining the right problem, we can generate hypotheses. Example: PDP is not compelling and relatable to Flash Sale (FS).
  • Hypotheses need to be broken down into a set of user stories. Example: as user visit FS PDP, I should be able to feel a sense of urgency and benefits created from the campaign
  • Set quantifiable metrics to set end goals for the experiment. Example: Conversion Rate to Add to Cart (ATC)

Who:

We pick random population of users and divide them into several groups. Then, we pick up which group are getting tested, but only the relevant users (e.g: FS users)

Illustration: Random population segmentation

If you look at this illustration, green dot is the group we pick and red dot is the FS users. The group that we picked has to be tested against a control group of the same size. Hence, the two dots in the picture.

When:

Correct testing period also contribute in results accuracy. We usually use good rule of thumbs, where we do experiments for 2 weeks, to avoid cyclical effect.

How:

1. Result accuracy highly depends on how you monitor the execution.

We had some bad experience where we executed experiment in 2 weeks. Seeing the results after 2 weeks, only 3-days data were reliable due to tracking error on the rest of the week.

As a cost to the aforementioned lack of monitoring issue, we need to repeat the experiment again to get a more accurate data. Thus, health check monitoring are important. We can have several checkpoints such as:

  • Ensure the right user population are receiving right feature (control or experiment)
  • Ensure tracker are doing good in testing period
  • Ensure feature that being tested are working perfectly
  • Detailed analysis to understand context

2. To present the data better so that every stakeholders can learn in the same understanding, we can convey the data into story. For example:

  • “CVR to ATC shows XX% improvement on Variant version”
  • “As result of optimizing above the fold info, X more user spend less than 15 seconds in deciding to ATC”

Various Type of A/B Testing

A/B testing can be done to check new feature performance or to optimize existing features. Here are a few examples on low hanging fruit improvement we have done through A/B testing for existing feature in Tokopedia: Layout, Content (Text or Image), and Copy.

1. Layout

  • User Problem: majority of landing page have rich content in order to provide thorough and inspiring information for user. Yet, due to low Click Through Rate (CTR) and Conversion Rate (CVR), we assume this rich content distract user from high quality products that we actually offer.
  • Hypothesis: by reducing the number of content and displaying straight forward product, user can be more focused on campaign purpose.
  • Result: reducing unnecessary content bring xx% increase CVR and xx% increase CTR
Comparison between two differing layouts

2. Content

  • User Problem: CTR on promotional banner are low.
  • Hypothesis: optimizing space on banner by reducing copy and Key Visual (KV) will be easier for user to digest, thus, improving the CTR.
  • Result: Simplified content deliver XX% CTR.

Case A:

Simple KV can reduce friction in user browsing journey and improve engagement (clicks/session)

Case B:

Simple KV can reduce friction in user browsing journey and improve engagement (clicks/session)

3. Algorithm

  • User Problem: User need to scroll far on product section to find relevant product for them.
  • Hypothesis: Providing relevant product to each user (personalized) will reduce buying decision time thus improve CVR.
  • Result: Providing personalized product instead of curated product improve XX% CVR.

Those are some successful case of A/B Testing in Tokopedia. But unlike Disney movie, we don’t always get a happy ending result on our testing. Some of them shows positive impact while the other tell us to work harder.

When we put our best effort to develop a new feature yet the result is below our expectations, bad result on A/B Testing could be seen as a downside. At the same time, it is a big relief for us to figure it all out and make room for improvements before we ship it 100% to all users.

A/B Testing method is a good exercise for Product Manager to get to know their existing and upcoming products better, to get continuous learning and to share it with other stakeholders. Specifically for us in Tokopedia, to be able to get to know our consumers better to continuously run our #1 DNA: Focus On Consumer.

— — — — —

Special thanks to Gabriella Amanda Kawilarang and Harvey Tjiupek for your contribution on this Blog!❤️

--

--