Leveraging Feature Flags and Google Analytics for Effective A/B Testing

Published in

SSENSE-TECH

10 min readOct 20, 2023

Customer experience is paramount at SSENSE. We leverage the use of different strategies to make data-driven decisions when creating products or implementing features that could introduce new experiences for our customers.

Our global technology platform is constantly evolving and stakeholders must rely on quantitative data to determine the impact of new changes, allowing them to make decisions based on evidence rather than assumptions.

That’s where A/B testing comes into play. It’s a powerful technique that allows businesses to experiment with different variations of a product, feature, or design and measure their impact on customer behavior.

This article will provide an A/B testing roadmap and how to implement it, from an engineering perspective, using feature flags and Google Analytics.

Understanding A/B Testing

At its core, A/B testing involves comparing two versions (A and B) to determine which one performs better. By exposing, simultaneously, a variated version of a feature to a sample of customers (the treatment group) and the existing version of that feature to another sample of customers (the control group), we can extract valuable insights about user preferences and behaviors.

Monitoring a chosen metric(s) is important during the testing phase to identify differences between versions. The version that moves a metric in the positive direction will be the ‘winner’ and ideally be rolled out, resulting in a successful test treatment. If results are negative, businesses can still get insights and learn from them to decide whether they need to revise and relaunch the feature or consider a different approach.

The goal of a test is not to get a lift, but rather to get a learning — Dr. Flint McGlaughlin

A/B testing is a task that involves a collaboration of multiple teams. Analytics provides the groundwork for informed decisions, Product conceptualizes variations and hypotheses that align with business goals, and Engineering ensures a smooth implementation. Regular sync-ups ensure that insights flow seamlessly from Analytics to Product and are correctly implemented by Engineering.

A/B testing comes with its own set of advantages and disadvantages, which are essential to consider when implementing this methodology in your development process.

Benefits:

Eliminates guesswork and enables informed decision-making based on real user interactions.
Reduces the risk of implementing changes that might have adverse effects on the overall user experience.
Results are not biased because customers are not aware that they are being tested.
Fosters a culture of constant learning and adaptation.
– Organizations can continuously refine their offerings based on real-world data, leading to ongoing enhancements.

Drawbacks:

Demands time, resources (Product, Engineering, and Data), and technical infrastructure to set up and execute properly.
Prioritizes short-term gains, which might not capture the long-term effects of changes.
Could lead to wrong conclusions if not conducted properly.

Designing the Test

We’ll delve into the steps of designing an effective A/B testing strategy using a fictional example. Let’s say, we have a call to action button on our checkout page, as the last step in the conversion funnel, that has “Place your order” as text.

And we want to validate that changing the possessive determiner from “Your” to “My” will be more appealing for customers to click on: “Place my order”.

Of course, this is just a basic example but it’ll give us an idea of how we can go through each step of an A/B testing plan.

Defining a Goal

Before embarking on the A/B testing journey, it’s important to outline clear goals. What specific metrics are you aiming to improve? Is it conversion rates, click-through rates, engagement, or some other key performance indicator? A well-defined objective provides focus and ensures that the testing process remains aligned with your overarching goals.

👉 Our goal is to ensure we see a positive conversion rate with no negative impact on customer experience when changing the text from “Your” to “My”.

Choosing a Metric

Selecting the appropriate metric is one of the most critical aspects of A/B testing, as this metric will determine how well the feature performs between the treatment and control groups. Choosing poor or too many metrics might result in false positives that lead to wrong conclusions. The more you’re measuring, the higher the probability of encountering random fluctuations.

Some metrics that are often used in A/B testing are the Click Through Rate, Click Through Probability, Revenue per Session, and Conversion Rate.

👉 Conversion rate is a metric that shows how many users completed an action out of the total number of users. In our example, this would be our chosen metric.

Hypothesis

Typically, the hypothesis comes from collaborative brainstorming between both the Product and Data teams. It states the expected outcome of a test and provides a direction for experimentation.

👉 Our hypothesis is that by changing the call to action button text on the checkout page, we’ll increase our conversions because it’ll incentivize customers to place orders as the user will feel more involved in the action.

Splitting Users

The success of an A/B test depends on the fairness of group distribution. That is, the traffic has to be split evenly between the treatment and control groups. Depending on the use case, the splitting technique will be different. Some of them are:

Event-based: On an event like a page view, users will be assigned to a group randomly. The experience is not consistent because users may be mixed into the two groups. It’s mainly used when the change is not visible to the user so it doesn’t affect their experience.
Cookie-based: When a user visits a website, a cookie can be set in their browser indicating whether they belong to the treatment or control group. Similarly, users will see a cross-device and cross-browser inconsistency. A user could see different variations of the feature because the cookie is assigned in the browser.
User-based: It’s more consistent, but users need to log in to use their IDs when assigning groups. So when they use a different browser or platform, the experience is maintained (they see the same change every time).

👉 For our example, we’ll use a user-based diversion because this change would be applied on the last step of the checkout flow where the user is already logged in.

Collecting Data

We have to make sure that the right data is collected and that there are no errors or outliers. In order to provide insights, reports, and dashboards that can be used to analyze user behavior, we have to track their interactions such as page views, clicks, events, and more.

One of the existing tools to track users is Google Analytics (GA). It offers dozens of features but we’ll focus on two of them that can serve the purpose of A/B testing.

👉 GA Events: allows us to track an occurrence on our website such as clicks, views, downloads, etc. We want to trigger an event when a user clicks on our call to action button when placing an order.

👉 GA Custom Dimensions: used to collect data that is not automatically tracked. We can say that the group of a user, which we will call “bucket”, is a custom dimension. We would have two possible values for the bucket: test (treatment) or control (status quo).

Monitoring

Regular monitoring of the experiment’s progress helps identify any anomalies or technical issues that could skew the results. It’s crucial to avoid premature conclusions and run experiments for an appropriate duration.

Usually, you run a test until you reach statistical significance, it depends on sample size, traffic, and goals. On low-traffic and conversion sites, it may take weeks or months to get enough traffic to reach statistical significance. Moreover, daily statistical assessment is recommended for the duration of the test.

👉 For the conversion lift we’re looking for, we’ll run and monitor our test over a 4 week period.

Implementation

Once the preparation work has been completed. The Engineering team must ensure the preservation of integrity between the Control and Treatment groups. So let’s see how this can be translated into the actual product.

Feature Flags

Since we chose a user-based splitting technique, we’ll use feature flags (for a detailed explanation, you can take a look at this article). First, because we want to hide the new implementation when initially deployed to all users, and second because it’s always recommended to do canary releases when rolling out new features.

There are different open-source and paid options to handle feature flags, this example covers the essentials of any option you choose.

if (featureFlags.isEnabled('newCTATextButton')) {
  button.text = 'Place my Order';
} else {
  button.text = 'Place your Order';
}
return button;

Now, depending on the tool you’re using, you’ll have to set up a gradual roll-out strategy. When the canary release is at 50% of users, that’s when the A/B testing kicks off and we start collecting, and monitoring the data.

Google Analytics

Disclaimer: 1. Setting up Google Analytics is out of the scope of this article. 2. The following example uses GA4, if you haven’t migrated yet from Universal Analytics, the steps may not be the same.

Let’s configure a custom dimension that will be referenced in our code. In GA, go to Admin → Custom Definitions, click on the Create Custom Dimensions button, and then enter:

Name of the parameter: How the dimension will be displayed in your report
Scope: Since we want to apply this dimension to all the events of the same user, we select User (to aggregate all the metrics and filter per bucket)
User property: This is the value we’ll be using in our code when referring to this custom dimension

Then, we have to set the dimension in our code, based on how buckets are assigned. Update the previous code snippet adding a GA set call.

if (featureFlags.isEnabled('newCTATextButton')) {
  button.text = 'Place my Order';
  window.gtag('set', 'user_properties', {
      experiment1: 'test', //test bucket is the treatment 
  });
} else {
  button.text = 'Place your Order';
  window.gtag('set', 'user_properties', {
      experiment1: 'control',//control bucket is status quo
  });
}
return button;

When the page is loaded, GA should emit a page view event with the bucket assigned. You can check the network call to make sure the user property is set.

*GA page view event with a custom dimension*

When the user clicks on the CTA button, we trigger an event that will have the additional user property we set when the checkout page was loaded, adding the following:

function onPlaceOrderButtonClick() {
  ...
  // Custom event that has the user bucket in its context
  window.gtag('event', 'Click on complete payment', {
      event_category:'Checkout',
      event_label: 'Place order',
  });
  
  this.placeOrder(); // This is the actual request that will create
                     // the order in the backend
}

Most of the engineering work is done, we now have to wait for the results and sync with other teams to adjust the feature if needed, or proceed with the roll-out.

Analyzing Results

Google Analytics won’t tell you anything about statistical significance, p-values, error margins, etc. You have to pull that data into another analytics tool. If you’re streaming data to BigQuery then you can do the analysis there because GA4 allows you to send custom dimensions so they’ll be visible in BigQuery.

There are multiple ways to find and get custom dimension insights in GA4, one of them is the following:

Go to Reports → Engagement → Events report.

*Number of events by the custom dimension created*

Assessing User Split

There is no tool that will split users exactly 50/50, it depends on several factors so we have to take into account that difference when doing calculations.

👉 Continuing with our example, the number of users assigned to each bucket was almost even.

Evaluating the Metric

We have to calculate the conversion rate of each bucket per day, then compare control and treatment to know which one is winning. The Analytics team will employ techniques, like Frequentist or Bayesian probability, to interpret results and will present a report containing the needed information to make decisions.

👉 The positive values correspond to our treatment bucket. We can see that there’s a favorable trend after 4 weeks and it’s relatively stable.

Making the Decision

Our final report looks like this: “Control: 13.89%, Treatment: 14.34%”. This means that ~14% of the users clicked through on the new variation, and the conversation rate is better than our current variation. The relative difference can also be translated to a 3.23% lift on CR. In this case, it’s most likely a good decision to switch to your new version.

👉 We will switch to our new version after a successful A/B test.

Conclusion

A/B testing stands as a cornerstone of innovation and growth. At SSENSE, it is part of our development workflow, allowing us to roll out new products and features — to improve our customer experience — with confidence, precision, and a commitment to excellence.

Although there are different tools that offer out-of-the-box features to implement and run A/B tests, they can be expensive and may not accommodate your needs. We’ve seen how feature flags and Google Analytics can be synchronized during A/B testing to provide a not-too-complex but powerful strategy that you can consider adding to your toolbox.

Editorial reviews by Catherine Heim & Mario Bittencourt

Want to work with us? Click here to see all open positions at SSENSE!