Monitoring errors in your A/B tests

Published in

Preply Engineering Blog

7 min readJan 13, 2022

A/B testing is an essential tool for improving the product. Here at Preply, we launch hundreds of tests quarterly, iterating on our product at incredible velocity. But launching a test is always associated with some risks — you can never be sure that you have tested every single corner case and won’t introduce any issue, especially if you actually “move fast.” Even more, some issues might appear due to different A/B test interactions which cannot be always predicted. What’s the solution? Proper monitoring. And I’m talking not about a “wait till someone contacts customer support” style, but a more automated, data-driven approach.

I’m Matvii and I am a Software Engineer at Preply.com. Let me explain how together with the Experimentation team we have built an in-house tool for error monitoring in A/B tests. So, without further ado, let’s get started!

What do we measure?

To start with, we need to choose a good proxy metric for the definition of the “broken experiment.” Our first MVP used request status codes for that purpose; that’s why the article will be mainly focused on this way of measurement. At the same time, different things might work for you. Depending on the architecture you have and the data available, you may think about frontend errors in Sentry, backend errors with a high log level, even the number of visits on the “I need help” page can be handy, so be creative here 😉

Source of truth

As I already mentioned, we are going to focus on using status codes. How do we retrieve them and attribute them to users? Welp, first let me show you the architecture we have:

And we have logs everywhere… But:

Cloudflare logs are limited in the sense of information that can be used for user attribution. It is not possible to log cookies there, and implementing any other fingerprinting is not that easy, so this is not an option;
Cloudfront allows logging cookies, but either none or all of them. It might be a privacy threat (having personal data in logs is never a good idea), so not an option as well;
Per-service access logs might be useful (taking into account that we can add any information we want there), but:
— Some errors are not logged as errors on the application level — for example, 499 stands for “client did not wait for the response,” but the application can actually return 200 in this case;
— Some service-level errors cannot be adequately logged (worker is OOMing or something like that);
— If service is not available / Ingress does not have rules for URL / etc., no application will serve the query; thus, there is no one to log the error;
GraphQL Federation logs cannot cover all the use cases (we still have some REST API).

Finally, Nginx Ingress logs. It seems like the most reliable source of data — we have access logs for everything under our domain, plus we can change log format as we want.

Once we’ve chosen the data source, it’s easy to come up with the following data transformation path. Of course, we designed it based on which tools we already had and what was the easiest solution for us, and in your case some things might be different. So our current setup looks like this:

User attribution

The easiest way to attribute logs to users is to put a cookie value into the access logs. Let’s use the user_idcookie. We use fluentd for parsing & saving logs to our Elasticsearch (we use ELK stack for most logs), so first, let’s adjust the parsing rule. We want it to be backward compatible, so we will just add an optional regex group. So let’s say it was:

format /(?<remote_addr>[^ ]*) - (?<remote_user>[^ ]*) ... (?<upstream_status>[^ ]*) (?<msec>[^ ]*)/

Now it should be:

format /(?<remote_addr>[^ ]*) - (?<remote_user>[^ ]*) ... (?<upstream_status>[^ ]*) (?<msec>[^ ]*)( user_id:"(?<cookie_user_id>\w*)")?/

We are ready to adjust the ingress log format. We had:

log-format-upstream: '$remote_addr - $remote_user ... $upstream_status $msec'

Now we will have:

log-format-upstream: '$remote_addr - $remote_user ... $upstream_status $msec user_id:"$cookie_user_id"'

Cool, now we have everything in our Elasticsearch. I’ll skip the boring synchronization part between Elasticsearch and Redshift (our data warehouse), as it may differ based on what storage you use. I’ll just share the table we get after all the transformations:

CREATE TABLE production_logs_transformed (
    user_id             INTEGER,
    datetime            TIMESTAMP,
    code                INTEGER ENCODE az64,
    referrer            VARCHAR(1000),
    path                VARCHAR(1000),
    proxy_upstream_name VARCHAR(100)
);

Note that referrer, path, and proxy_upstream_name are not used in the monitor itself, but can be helpful for debugging purposes.

We also have a table with A/B testing events, that can be simplified to look like this:

CREATE TABLE test_entrances(
    user_id            INTEGER,
    test_entrance_date TIMESTAMP,
    experiment_name    VARCHAR(100),
    test_group         VARCHAR(1)
);

Data transformation

Raw logs tell us [almost] nothing by themselves, so let’s transform them into the numbers. First, we need to calculate per-user, per-experiment error statistics. Relation for this data will look like this:

CREATE TABLE aggregated_error_stats(
    user_id            INTEGER,
    test_group         VARCHAR(1),
    experiment_name    VARCHAR(100),
    test_entrance_date TIMESTAMP,
    errors_5xx         INTEGER
  -- <any other error types you want>
);

Now, when we have per-user data, let’s apply some math here. We can analyze both binary (is there any 5xx?) and continuous (how many errors did user’s face?) metrics. Note that binary metrics have higher sensitivity, but at the same time they are not always applicable. For example, if you are trying to analyze the number of 404s and all the users somehow manage to see 400, you won’t be able to detect any changes since the base conversion rate is already 100%

The next steps should be pretty familiar for those who work with A/B testing. Analyzing binary metrics typically is done using the Chi-square test, for which we need conversion rate for both groups. Continuous metrics can be analyzed using T-test, which requires us to calculate the means and standard deviations in both groups. Luckily, it can be easily expressed using SQL.

SELECT
  experiment_name,
  COUNT(CASE WHEN test_group = 'A' THEN 1 END) participants_a,
  COUNT(CASE WHEN test_group = 'B' THEN 1 END) participants_b,
  SUM(CASE WHEN test_group = 'A' THEN errors_5xx ELSE 0 END)  errors_5xx_a,
  SUM(CASE WHEN test_group = 'B' THEN errors_5xx ELSE 0 END) errors_5xx_b,
  STDDEV(CASE WHEN test_group = 'A' THEN errors_5xx END) errors_5xx_a_variance,
  STDDEV(CASE WHEN test_group = 'B' THEN errors_5xx END) errors_5xx_b_variance,
  COUNT(CASE WHEN test_group = 'A' THEN NULLIF(errors_5xx, 0) END) conversion_to_errors_5xx_a,
  COUNT(CASE WHEN test_group = 'B' THEN NULLIF(errors_5xx, 0) END) conversion_to_errors_5xx_b
FROM aggregated_error_stats
GROUP BY experiment_name;

Now let’s retrieve p-values based on the aggregated data. Scipy library has everything needed — chi2_contigency and ttest_ind_from_stats perfectly suit that purpose.

Note that everything described above is done automatically — we have a dedicated pipeline in Airflow that is launched every 10 minutes.

What to do with statistics

Incredible, we can now get p-values. Let’s establish some automated alerts based on that.

Our primary tool for monitoring is DataDog. Let’s just send metrics there and build some fancy dashboards :)

Part of the experiment-related dashboard we have in Datadog

I also mentioned automatic alerts — currently, we send a Slack message to the channel if the p-value is less than 1%. This number is lower than a typical 5–10% p-value for A/B tests because instead of checking the p-value once, we are constantly comparing it against the threshold. Making the threshold lower allows us to keep the false-positive rate on the acceptable level. As an alternative, it’s possible to use sequential A/B testing, but it’s way harder to implement.

Currently, an on-call person within the Experimentation team is responsible for checking this alert during business hours. Later, with making this tool more reliable and advanced, we expect to enable product teams to monitor their A/B test by themselves — such an approach seems more scalable for us as an organization.

Things worth mentioning

Finally, there are some recommendations I would like to give if you are going to follow this article and implement such monitor in your organization:

Not all “error” status codes are created equal! Sometimes an additional 400 or 404 is expected in the experiment, while an increased amount of 499 might tell you about slowing things down dramatically. Pay attention to what status codes you consider dangerous;
The number of page visits correlates with the number of errors, naturally. If, for instance, you are testing an email newsletter (the A group does not have any, and the B group does), it may get more users to your website, but expect an increased amount of errors as well. Consider using metrics like “average number of 5xx per page visit”;
Calculating the number of 4xx/5xx as a standalone metric (without any alert) might also make sense. Such a metric can give you ideas when it comes to surprisingly negative tests (for example, a really cool experiment that shows a significant drop on your North Star metric might have problems with implementation, not with the feature itself).