Conquering A/B Testing (1): When should I conduct A/B testing?

Jamie Lee
Hackle Blog
Published in
4 min readMar 21, 2022

This article is the first post of Hackle’s ‘Conquering A/B Test’ series. This series deals with the common questions many people have with regards to the entire process of A/B test design, preparation, result interpretation, and final decision-making. To check out the next post in the series: “How should I set my metrics for an A/B test?”, click here.

Hackle’s Conquering A/B Testing covers the following topics:

1. When should I conduct A/B testing?

2. How should I set my metrics for an A/B test?

3. How long should an A/B test run for? What should my sample size be?

4. How should I set the user identifiers?

5. Can we deduce a causational relationship from an A/B test?

6. When is the right time to stop an A/B test?

7. How can I reach my conclusions with ambiguous A/B testing results?

8. What should I do if I want to restart an A/B test?

9. How should I conduct review processes within my organization?

***‍

The moment of launching a new feature of a web/app service is always exciting but frightening at the same time. Product owners and developers are constantly bombarded with worries about how customers will react to new features and whether or not there will be any errors during the deployment process.

The right way to deploy features

When deploying new features to your product, there are three following options to choose from:

  1. Deploy right away to production.
  2. Deploy with added traffic control via feature flags where new features can be gradually released and rolled back when errors occur.
  3. Deployment of the winning feature after A/B testing different versions of a feature (existing and new) against each other.

Deploying right away to production and deploying with feature flags may seem different but they are essentially the same in that these two options launch a new version of the feature on production. The only difference is that deployments with feature flags reduce risk by giving users the autonomy to roll back erroneous features with a kill switch, release features incrementally to larger traffic and target a desired segment of users. In the end, both of these options deploy a new feature over the existing one.

On the other hand, with the method of deployment of the winning feature after A/B testing the different features against one another, you are able to use data to validate which feature to deploy. This means that there is a possibility that you may end up sticking to the existing version of the feature rather than deploying a new feature depending on the results of the A/B test.

Can’t I just deploy first and then roll-back the feature if a problem occurs?

You can also choose to keep your newly deployed feature after you find out that your key metrics have improved upon release. However, the problem, in this case, is that it is not possible to know precisely whether such positive reaction from customers is due to the ‘new feature’ or other external factors that have occurred at the same time.

As in the image example below, consider a case where sales increased by 23% when a new feature was released to users. The product team that prepared the new feature may have celebrated the successful launch. However increase in sales may have been due to external factors such as increases in online shopping due to COVID-19 or promotions that were running simultaneously by the marketing team, rather than from the feature change.

Once the data for such metrics are also tracked, in reality, the new feature actually reduced customer usability, causing a 7% decrease in sales. This finding could not have been picked up if the product team only focused on the overall increase in sales.

The sheer amount of external factors that exist is the reason why experiments and A/B tests are crucial within the decision-making process of choosing to release a certain feature over another.

Creating the right deployment process for your organization

There is no one-size-fits-all answer to all situations. When the problem that needs to be improved is clear and urgent, simple deployment or deployment with feature flags can be a resource-saving method. In other cases where it is difficult to predict customer responses to new features, monitoring the data from A/B testing should be done in order to make sure that key metrics are not exacerbated before making a decision to fully deploy a specific feature.

Therefore, it is necessary to establish an agreed-upon deployment process within the organization so that the right deployment method is chosen for various situations.

  • In general, A/B testing of features that can directly affect customer experience is recommended. This is because product teams may think they know their customers well, but in reality, there is nothing better than data that clearly reflects customers’ intentions and behavior.
  • It is recommended to deploy features that do not directly affect customer experience (ex. internal API or infrastructure changes, etc.) gradually through feature flags. This is because, if an unexpected problem is discovered in the process of monitoring system latency and failure, it is possible to secure time to respond.
  • A/B testing is unnecessary in cases such as bug fixes and typographical fixes, and internal discussion allows you to skip the A/B testing phase even when there is little user traffic.

Check out Hackle at www.hackle.io in order to start creating your own A/B tests ‍to release the right features to maximize your customers’ experiences.

--

--