Monitoring and Alerting for A/B Testing — Detecting Problems in Real Time

Photo credit: a_roesler

At WalmartLabs we are building Expo, an A/B testing framework for our sites and apps. One of the critical components we have is monitoring and alerting to monitor different aspects of the A/B testing environment. The system needs to detect when there’s a problem with the assignment or with the behavior of users in an experiment and alert the appropriate people.

Overview

In this article, I will elaborate two of these alerting applications which were built by the A/B testing team for watching our real-time reporting system and to make sure the system and the experiments made in the A/B testing framework works properly as expected.

Before going further, I would like first to describe the big picture of how A/B testing works so that the audience has a better understanding of the flow.

In A/B testing each user will be assigned to different variations of an experiment. An experiment can be a UI change, backend algorithm change such as the search algorithm or anything that might impact the user behavior in a way that can be measured. One simple example for UI A/B testing would be having two different designs for buy button. One design has a big red button and the other one is small blue button. Each of these designs is a variation. The variation which is currently running on production is called “control”. After setting up the variations, Expo will split the traffic between these two(or more) variations and based on user behavior, generate metrics for each variation. For example, one metric could be conversion rate, which in our case is the percentage of visitors to our site that make a purchase. If the difference between metric values for those variations are statistically significant, we choose the winner and use it in production.

For generating these metrics, we have two jobs. One is a real time job that gives us quick, raw data of metrics in real time. The other job is running daily and gives us a more refined calculation from more accurate data sources.

Anomaly detection for the number of Visitors

One of the metrics that we track is the number of unique users who visit the site that are assigned to an experiment per minute. The real-time job keeps track of this number and pushes the metric to a time-series database. We noticed that there can be a problem in assignment or the real-time job that causes this metric to spike or drop and we wanted to be notified when this happens. One naïve solution for finding the discrepancy was taking the average for a short period of time(average_s_time) like a few minutes back and comparing it with the average for a longer period of time(average_l_time) like the last 3 hours, then we would expect that average_s_time shouldn’t differ from average_l_time by too much. Considering these values are following normal distribution we are able to evaluate average_s_time with 3-sigma test against the values of longer period and decide whether we have an alert for the past few minutes or not. This might seem like a good idea at first but this approach will end up with a lot of false positive alerts that makes the alert system less credible.

The reason for this behavior is because the data is periodic and the average will fluctuate based on the size of window and the time that we calculate the average value. Therefore, we needed to somehow normalize the data and remove the periodic factor from the data.

One useful algorithm which has been out for a while is seasonal decomposition of time series by Loess or STL(https://www.otexts.org/fpp/6/5) . This algorithm decomposes time series data into three different data sets: seasonal, trend and remainder. The following image shows a sample time series data and how these three datasets would be.

International airline passengers: monthly totals in thousands for 1949–1960 (https://tinyurl.com/yb45ca2n)

As you can see, remainder is the original data without seasonality and trend and you can track the sudden changes in this graph more easily. By applying the previous algorithm on remainder data, we are now able to catch problems in real-time while minimizing false positive alerts.

Real Time Experiment Metric Alerting (RTEMA)

When an experimenter designs an experiment and divides traffic between them, it’s very critical to know if something is wrong with experiment as soon as possible. For example, there might be a bug in one of the variations that prevents customers from going to the checkout page. We want to ensure that we catch this type of error and the system sends an alert if one of the variations’ data is off by a large amount.

As mentioned in the overview, Expo tracks different metrics in real-time. Most of them are ratio based meaning it comes from division of two numbers. For example, the conversion rate metric comes from the number of customers who bought something divided by the number of all visitors. However, some metrics, such as assigned visitor count are not ratios. The following algorithm runs every 15 minutes. Each variation has its own metric value generated by the real-time reporting system by minute, hour and day.

The above algorithm is easy enough to get so I won’t elaborate it more. The other metric this system watches is the number of visitors for both variations. If it’s not close to the proportion of traffic configured for each of the variations it will send an alert.

In order for the algorithm to become more practical and reduce the number of false positive alerts, we tweaked the original algorithm with these enhancements:

  • As you can see in line 9 each metric has its own threshold instead of using a global threshold. However we had to somehow determine the correct value for it. To do so, we selected some healthy experiments for sampling and and applied the same algorithm from line 8 to find a suitable value for Threshold(M).
  • The other tweak that we made was ignoring low traffic experiments. If the number of visitors are not bigger than a particular number we don’t go further.
  • The initial implementation of the algorithm was looking back for the 15 minutes and sending an alert if it was finding any anomalies. We changed the algorithm to run in two rounds. In the first round it finds the possible anomaly metrics for each experiment in the last 15 minutes, then it looks back again for a longer period of time with a smaller threshold(by default it’s Threshold(M)/2) and sends alert if the metrics we get from the first round pass the second round as well.

With these changes, we reduced false positive alerts dramatically while the system saved us on several occasions by sending true alerts even before the experimenter could catch it.

Conclusion

Alerting is a critical part of WalmartLabs ecosystem since even small problems can result in a significant loss in customers or revenue. While we have different tools for general performance and errors monitoring, some systems need extra consideration to find unusual behavior. In this article we presented two examples of such needs that we encountered in Expo and how we implemented monitoring and alerting to detect problems in visitor assignment or variation metrics to be able to maintain the quality of the system.

As the last part of this article of I would like to mention three important points that we followed during development of these algorithms.

  • Keep it simple: this is the principal that we take when the approach is heuristic and we would like to develop an algorithm that works with real data. For both of the above algorithms we started with tweaking the data and prototyped a simple version of the algorithm and after finding their weaknesses, we iterated to improve it. This makes the process of development much easier while it’s more practical to see the output of the algorithm against real data and easier to make the right decision for the next step.
  • Make it credible: When you develop a custom alerting system, you should take care of false positive alerts. It’s unavoidable to have some certain of false positive alerts because of the nature of real data however we need to consider this factor otherwise people will start questioning the credibility of the entire alerting system.
  • Externalize the algorithm’s parameters: This part is probably kind of obvious that we need to extract out the parameters out of the code and give it during the runtime. The reason that I put here is because it’s kind of related to previous section. We need to deal with true positive alerts in real-time here. We may want to tune parameters in real time, or be able to shut down alerts from overwhelming us and then address the root cause of the problem.