Monitoring and Alerting for A/B Testing — Detecting Problems in Real Time

Reza Esfandani
Jun 4, 2018 · 6 min read
Photo credit: a_roesler

Overview

In this article, I will elaborate two of these alerting applications which were built by the A/B testing team for watching our real-time reporting system and to make sure the system and the experiments made in the A/B testing framework works properly as expected.

Anomaly detection for the number of Visitors

One of the metrics that we track is the number of unique users who visit the site that are assigned to an experiment per minute. The real-time job keeps track of this number and pushes the metric to a time-series database. We noticed that there can be a problem in assignment or the real-time job that causes this metric to spike or drop and we wanted to be notified when this happens. One naïve solution for finding the discrepancy was taking the average for a short period of time(average_s_time) like a few minutes back and comparing it with the average for a longer period of time(average_l_time) like the last 3 hours, then we would expect that average_s_time shouldn’t differ from average_l_time by too much. Considering these values are following normal distribution we are able to evaluate average_s_time with 3-sigma test against the values of longer period and decide whether we have an alert for the past few minutes or not. This might seem like a good idea at first but this approach will end up with a lot of false positive alerts that makes the alert system less credible.

International airline passengers: monthly totals in thousands for 1949–1960 (https://tinyurl.com/yb45ca2n)

Real Time Experiment Metric Alerting (RTEMA)

When an experimenter designs an experiment and divides traffic between them, it’s very critical to know if something is wrong with experiment as soon as possible. For example, there might be a bug in one of the variations that prevents customers from going to the checkout page. We want to ensure that we catch this type of error and the system sends an alert if one of the variations’ data is off by a large amount.

  • The other tweak that we made was ignoring low traffic experiments. If the number of visitors are not bigger than a particular number we don’t go further.
  • The initial implementation of the algorithm was looking back for the 15 minutes and sending an alert if it was finding any anomalies. We changed the algorithm to run in two rounds. In the first round it finds the possible anomaly metrics for each experiment in the last 15 minutes, then it looks back again for a longer period of time with a smaller threshold(by default it’s Threshold(M)/2) and sends alert if the metrics we get from the first round pass the second round as well.

Conclusion

Alerting is a critical part of WalmartLabs ecosystem since even small problems can result in a significant loss in customers or revenue. While we have different tools for general performance and errors monitoring, some systems need extra consideration to find unusual behavior. In this article we presented two examples of such needs that we encountered in Expo and how we implemented monitoring and alerting to detect problems in visitor assignment or variation metrics to be able to maintain the quality of the system.

  • Make it credible: When you develop a custom alerting system, you should take care of false positive alerts. It’s unavoidable to have some certain of false positive alerts because of the nature of real data however we need to consider this factor otherwise people will start questioning the credibility of the entire alerting system.
  • Externalize the algorithm’s parameters: This part is probably kind of obvious that we need to extract out the parameters out of the code and give it during the runtime. The reason that I put here is because it’s kind of related to previous section. We need to deal with true positive alerts in real-time here. We may want to tune parameters in real time, or be able to shut down alerts from overwhelming us and then address the root cause of the problem.

WalmartLabs

Using technology, data and design to change the way the world shops. Learn more about us - http://walmartlabs.com/

Thanks to Anthony Tang.

Reza Esfandani

Written by

Engineer @WalmartLabs

WalmartLabs

Using technology, data and design to change the way the world shops. Learn more about us - http://walmartlabs.com/