How to be data-driven when you aren’t Netflix (or even if you are) — part 2: AB Testing

Published in

Lumen Engineering Blog

9 min readJun 25, 2018

In the previous article of this series, we covered the first steps of our journey towards becoming more data driven. We’ve explained why time series are simply not enough because of the noise introduced by environmental factors. In this article we’ll explain how we got over this by setting up an AB test workflow.

AB tests don’t remove noise from your dataset, but they do ensure that both populations, A and B, are subject to the same environmental variations, by testing two different configurations over the same time period. When your viewers are randomly allocated to each population, and with enough payloads in each dataset, the noise will average out; you can be reasonably sure that if you’re seeing a major difference in your metrics between the two populations, it’s due to the change you’re measuring, rather than an environmental change. Basically, it provides a reference point while allowing you to filter signals from noise, and therefore take better informed decisions.

Let’s go over a simple example to illustrate how AB testing eliminates background noise. Our customers’ video rebuffering rate tends to decrease as the number of concurrent users rises; as more users connect to a stream, the more local sources become available to the viewer, bringing the content closer to the individual user’s device. Let’s say we want to test a configuration change that is supposed to even further decrease rebuffering rates. If we try the new configuration on all users, it would be hard to pinpoint the new configuration’s exact effect on rebuffering, especially in cases where the number of concurrent users is volatile. And that’s without talking about other environmental factors that could affect rebuffering, such as geographical location and internet connection speed. Dividing the test pool into two groups randomly — group A with the old configuration, and group B with the new one — and comparing rebuffering results from the same time period means that we can safely assume that the difference in rebuffering rate that we witness is due to our configuration change. With enough data this would allow us to filter out the environmental noise, as any condition that might affect rebuffering would be equally present in both populations.

Putting it all together: getting rid of noise

When we started leveraging AB testing, we stopped looking at time series altogether to determine which population was behaving better. Although time series for both populations are subject to the same environmental noise during the time of the AB test — unless the change you’re testing has a tremendous impact — the time series will keep intersect, making it rather hard to determine if one of the population is “on top”. Let’s take a look at a time series showing the percentage of P2P offload during an AB test. Each point in the time series represents the average value over a 5-minute interval:

To overcome this representation issue, we started comparing a single aggregate value for each population:

We thereby remove the time variable from the equation. When running an AB test for long enough, the aggregate value will start converging to a more stable value and successfully average out the environmental noise.

Let’s take the same data set, but this time let’s plot the cumulative average value of the entire data set over time:

This clearly shows that when we don’t have enough data points at the beginning of an AB test, the result is still very noisy. That’s pretty normal since we split the two populations randomly; an impactful environmental factor might still be over-represented in one of the two populations at this point. But as we let our AB test run and gather more and more data points for both populations, the environmental factors become more uniformly represented within each population, eventually converging into stable, legible, noise-free results.

So now that we’ve established that it’s important to gather enough data points during an AB test, you might ask yourself how much data is enough? Well, that’s the tricky point. It really depends on how noisy your data is: how much the metric you’re looking at is affected by different environmental factors, and to what extent these factors are variable in the test population.

Let’s take an example: let’s say we’re looking at the rebuffering ratio for viewers in Scandinavia. Since most internet connections in this region enjoy plenty of bandwidth, and users usually have similar conditions, we expect very low noise levels in our data and therefore require less of it. When looking at rebuffering ratio for a given stream for viewers across the whole world, it is a different story. Bandwidth conditions will be very different from country to country, creating a pretty noisy data set, and therefore require that we run my AB test for a longer period of time and gather more data points in order to have relevant information for our AB test.

So how can we determine the quantity of data points we required for trusting AB test results? The simplest way is to run an AA test and see when the value for both populations becomes consistently identical: the two time series should be almost superimposed starting from the moment you gathered enough data.

Another option is to use statistical analysis tools on your data set in order to determine a level of confidence. We’ll cover those in the next article of our series.

AB Testing at Streamroot

We implemented two different levels of AB testing in our workflow: Release AB Testing, which allows us to compare two different versions of our library; and Configuration Injection AB Testing, which enables us to inject parameters to fine-tune our algorithms and to toggle features on and off.

We perform Release AB Testing by using a reverse proxy: whenever a user requests our client-side library in the beginning of a video session, the request first hits a reverse proxy that can be configured to return different files according to different rules. We use Release AB Testing whenever we roll out a new release of our product. You can configure the reverse proxy to randomly return files A and B with a 50–50 distribution to do AB testing, or with a 95–5 distribution to do “canary testing”, i.e. providing the new version to a small fraction of users, and then ramping up those numbers to distribute the new version gradually as you collect more data.

We perform Configuration Injection to test small changes in configuration parameters, as well as to toggle specific features on and off while rolling out a new version of our product. In order to perform Configuration Injection AB Testing our library loads a tiny configuration from our servers. In the same fashion as for Release AB Testing, our servers can be configured to return different configuration files according to any distribution.

This configuration file contains different values for parameters that will change how our algorithm behaves. By AB testing different values for these parameters we can find those that behave best and fine tune our algorithms. This configuration file also allows us to toggle specific features on and off so we can compare two different flavors of an algorithm.

These two mechanisms have truly empowered us in our journey to make our technology even more efficient. This process has enabled something we consider crucial when rolling out changes in a complex technology, where real world results can fall far from expectations and from what you can observe in a lab environment: the ability to test changes atomically.

When we prepare a new release, we use both Release AB Testing and Configuration Injection AB Testing. In every new release, we make sure every new feature is togglable, and deactivated by default. That way, the new release with its default configuration is identical to the previous one. We then proceed to Release AB testing to test both releases side by side and make sure all our metrics are identical. This way we ensure that there is not any regression hidden in our build. Once the new release is validated, we gradually deploy it for 100% of the viewers. Once our new version is 100% deployed, we use Configuration Injection to AB test the new version features, one by one, activating toggles one at a time to verify that each new feature generates the expected results compared to populations where the feature hasn’t been turned on yet. If a change doesn’t produce the improvements we expect, even after some parameter tweaking, we disable it and ended to figure out why it did not work as expected.

AB Testing: An effort well worth making

This approach may look cumbersome, but quality is paramount to us. We believe this is the most robust method to roll out changes for our customers. We have a rigorous engineering process including code reviews, automated testing and manual testing, but software being software, it is possible that bugs may find a way to slip through the cracks. This holds true even in mission critical software you’d think would be tested in and out, and thus bug proof. Like this glitch that almost started World War III some 35 years ago.

And if you’re still thinking this is overkill, here’s another thing to consider: without this process, every regression means a complete rollback that can stall new releases and product enhancements and improvements. With this mechanism in place, we minimize the risk of a bug in the first place, and in the event of an issue in the one of the new togglable features, we just toggle it back off and fix it in the next release, saving time, money, and working far more efficiently.

Finally, testing changes atomically has another benefit: imagine we released several features in a single release, without a toggle mechanism. Let’s say each of the first three of these features mildly improves QoS, whereas the fourth new feature degrades it, just about cancelling the effect of the first three improvements. Looking at the stats, we might think that none of these features changed anything, when in fact they averaged out. We might decide not to keep any of them, as they don’t seem to improve QoS, actually losing 3 improvements in the process. We might decide to keep all of them, with the goal of tweaking them in the next release, letting a regression slip into our product without being aware of it. Either way is far from ideal..

AB testing is a powerful, indispensable tool that allows us to be better informed and make the right data-driven decisions. In this article, we discussed mostly how to test different product versions and configurations. This brings us to the next, equally important question — what data do we test? To reach the right conclusion, we have to take into consideration the right data, make sure that we have enough of it, and make the proper data splits. We’ll try to answer all those questions in part 3 of this series, so stay tuned!

At Streamroot, we are always on the lookout for new tech talent to join our team. Check out open positions on our career page or send us your CV directly to contact@streamroot.io.

How to be data-driven when you aren’t Netflix (or even if you are) — part 2: AB Testing

Putting it all together: getting rid of noise

AB Testing at Streamroot

AB Testing: An effort well worth making

Written by Axel Delmas