In this post, we will describe how we built a framework using Apache Spark and AWS Athena to make our standard A/B test analysis scalable, reliable and easy to access. We’ll mainly focus on the technical requirements, system architecture along with the implementation choices we made.
A/B testing at Teads
Why A/B testing?
Each day, the Teads platform makes it possible for advertisers to display their ads on the websites of the best publishers around the world. Advertisers find this way a quality environment for their ads while publishers get paid for the content they produce. To do so, we heavily rely on machine learning. For each opportunity to display an ad on a publisher website, an entire process that leads to the selection and the serving of the best ad is run in a few milliseconds.
This process is triggered up to 6 million times per second and relies on many components. These different components, among which we find our prediction models and bidding strategies, are maintained and improved on a daily basis to ensure the best possible quality of service for our clients. In this quest of always improving our platform, it is of first importance to have a strong and reliable framework to test any changes we want to bring in production and analyze all the impacts it has for us and our clients.
As an example, we regularly test new features to be added to our prediction models in order to get more accurate predictions. Also, we always try to improve the performance of the bidding strategies that are based on these models in order to optimize the delivery as much as possible. All these changes can have huge impacts in production and should be evaluated carefully before being deployed.
So far, the standard A/B testing approach has been the natural way to go when it comes to assessing the effects of a change in our platform. A/B testing aims at comparing the effects of two setups (sometimes more) using a randomized experiment: the users are randomly split into groups, called populations, and each of these groups is exposed to one specific setup. On average, Teads teams run about 10 to 15 A/B tests per month. For each of these A/B tests, key metrics are computed by populations to decide which version of the code should be deployed in production.
Teads A/B test platform
Let’s start by describing the structure of the operational platform that is used to define and operate A/B tests at Teads. This platform is composed of two key components:
- A population split function
- A population behavior definition
The population split function defines a deterministic way to decide, for each incoming request, which population it belongs to. The split is usually based on a unique identifier of the underlying user. The split function is configurable, although we commonly choose hash-based functions.
To define the behavior for each population, the A/B test platform provides what we call dynamic parameters. These are runtime parameters that can have different values depending on the population. It can be viewed as a function, evaluated upon each request:
(parameter_name, population_id) => parameter_value
The A/B testable configurations are expressed in dynamic parameters, for example prediction models to use, or some feature flags to activate.
All these elements are captured inside a common library used across all Teads engineering teams. A dedicated user interface allows the different teams to easily create and operate (start, edit, stop, deploy) the A/B tests. The different populations and their proportions simply have to be declared along with a set of dynamic parameters and their corresponding values in each population.
Existing analysis tools
Once our A/B tests have been defined and launched, data starts to be gathered and we can begin to monitor and analyze the results. The main ways we had to analyze an A/B test was either running Jupyter notebooks with a custom library, or using BigQuery. While the former choice gives us great flexibility, using notebooks requires knowledge on code. BigQuery provides a better user experience, but it is much more complicated to compute confidence intervals, which are of first importance to assess the statistical significance of the results. On top of that, these two methods are both error-prone, slow and difficult to standardize.
All these limitations have led us to design another tool much more efficient for standard analysis. The purpose of this new analytical framework is to enhance our operational platform in order to make day-to-day A/B test monitoring and analysis faster, more reliable and more accessible to anyone involved in the decision-making process.
In the following, we will describe in-depth this framework as well as our engineering choices.
A/B test analysis framework
All our A/B test analysis can be defined by 3 elements:
- A log: the data to be analyzed
- Some metrics: the quantities of interest when comparing the populations
- Some projections: the dimensions to aggregate the metrics on when comparing the populations
An analysis should then simply take the data of a log and aggregate the metrics of interest along with all the combinations of projections that have been specified. In order to make analyses more reliable, confidence intervals must be available for all the computed metrics so that to be able to assess the significance of the observed results.
From an engineering perspective, we want the analysis framework to be scalable and fast in order to avoid slowing down the decision-making process. The volume of data we are facing prevents us from running all the computations right at the analysis time. This also encourages us to avoid repeating the same computations if the analysis is run several times (for example on consecutive days).
The natural solution is to precompute the analysis data hourly and store it in a format that is generic enough to ease later querying across different tools. In our case, we wanted the data to be queryable from analytics BI tools for standard needs and from Jupyter notebooks for more specific needs.
A word on confidence intervals
Confidence intervals make it possible to quantify the uncertainty coming from the randomness of the observed data when analyzing the metrics of an A/B test. They are important tools that must be used during the decision-making process. The need for confidence intervals at query time has a non-negligible impact on the design of our framework.
The bootstrapping method is widely used to compute confidence intervals. The global idea of bootstrapping is to generate artificial data samples (called bootstrap samples) by sampling with replacement from the original data samples. The bootstrap samples can then be used to compute bootstrapped values of the metric of interest (just as we compute the raw metric from the raw sample). These bootstrapped values will in turn serve to derive a confidence interval at a given level.
In practice, bootstrap generation by resampling is not possible in our setting as we precompute data hourly without having a global view of the sample. A standard approach in such a setting is to use the well-known online Poisson bootstrapping that makes it possible to handle bootstrapping for “streaming data”.
For a given sample of data, Poisson bootstrapping consists in generating artificial samples by reweighting the lines with a Poisson distribution of parameter 1. The weights correspond to the number of times each line should be counted in the new artificial sample. Various bootstrapped metrics can then be computed taking into account these weights. More details can be found in the appendix of this article.
Taking all the requirements above into account, we’ve proposed the following architecture.
On the first level, we have Apache Spark jobs that are scheduled each hour to pre-aggregate some metrics, grouped by A/B test populations and various projection dimensions. For each metric, we use the Poisson bootstrapping technique to generate an array of bootstrapped values that will be required to compute confidence intervals at query time.
The pre-aggregated data are then processed by a query engine at analysis time to obtain the final result. To simplify the system design, we prefer to do as much computation as possible with the query engine to avoid coding custom logic in the dashboard system, which is more error-prone and less practical to test. Notice that the chosen query engine has to be flexible enough to let us compute confidence intervals from the arrays of pre-aggregated bootstrapped values.
Not all query engines satisfy our needs. We finally chose AWS Athena. In addition to the query capability that handles all our computation requirements, Athena also comes with other advantages:
- Ready to use and no ops needed
- Run queries against files stored on s3, making the ETL process very simple
- Only pays for the amount of data scanned, making it easy to optimize the cost
The last module is the dashboard. We use our company-wide BI dashboard system. Each dashboard is composed of a series of charts and tables, backed by dynamic queries to Athena.
In the following sections, we’ll discuss different challenges we faced and some design choices we made.
One common challenge in reporting/BI systems is projection dimensions with very large cardinality. This is more so for A/B test analysis as the computation of aggregation is a lot more complicated.
Our solution is based on the fact that the projections with high cardinality are typically used for very specific analysis. Or, we can say that 95% of the time they are not used. As a result, we choose to isolate those specific analyses from the common ones, by introducing the concept of analysis type. Each analysis type has its own set of projection dimensions and its own workflow, from hourly aggregation job to Athena table, so that they would not interfere with each other.
Each A/B test can be associated with multiple analysis types, configured at the start of the A/B test. The idea is that most of the time, simple analysis types would be used and if there’s a specific need to analyze a large projection dimension, specific analysis types can be included. This special purpose analysis may result in slow queries for itself, but it will not impact other analyses.
The outputs of the hourly jobs are the pre-aggregated data that all share the same pattern across the different analyses: for all the combinations of population and projections, we store the raw values and some bootstrapped values of each metric.
At this step, we only consider simple additive metrics. We can, for example, pre-aggregate impression counts, revenues, costs and so on. Due to the additive nature of the metrics, the basic part of the pre-aggregation simply consists in grouping the data by populations and projections before summing the different metrics to get the raw values.
To answer the need for confidence intervals we also compute bootstrapped values of the metrics. As we briefly mentioned above, if the raw metric is a simple sum (as it is our case here), a bootstrapped value of this metric is a weighted sum whose weights are drawn from a Poisson distribution of parameter 1.
The way we store the intermediate pre-aggregated data has a huge impact on query performance and cost. We also want this data to be flexible to consume, either by other applications or by future needs.
As stated earlier, the choice of Athena enables us to store these data as files on s3, which is a very flexible approach as they can be easily read by any other tool.
To optimize performance and cost, we use the Parquet data format, combined with a partitioning schema. Parquet is a columnar storage file format that enables the query engine to read-only necessary columns for the query and also helps to speed up the processing by storing the column’s values in a continuous fashion. A good partitioning schema lets Athena target only the right subset of the data upon each query. We find the most discriminating partition key is A/B test id and hour:
Also, we make sure to coalesce output files for each hour to ensure a reasonable ratio between file count and data size.
In the pre-aggregated data, the bootstrapped metrics are in the form of arrays. When we aggregate these arrays in the query engine, they should be piecewise summed because each index corresponds to the metric computed on a particular resampled dataset.
Athena provides useful primitives for piecewise aggregation over arrays. The following query snippet serves as a mini example:
We have lines with an array of metrics that should be aggregated. Notice that the clause UNNEST(col) WITH ORDINALITY associates the index with each value in the array. Then we can sum the metric, grouped by the array index. This is essentially a piecewise sum. In the case of a composed metric, say CTR, we only need to replace SUM(m) by SUM(clicks) / SUM(impressions) in the above query.
Another useful clause is MIN(col, n). It collects the first n values from a group into an array in sorted order:
Now we have a piecewise summed, sorted metrics array. We just need to apply one of the percentile methods to derive the confidence interval.
The choice of computing the data hourly implies that only simple additive metrics like revenues, costs or counts can be defined in an analysis to be pre-aggregated. Once hourly computed, such simple metrics can easily be aggregated again at analysis time depending on the user query. Other metrics, like ratios for example, are called composed metrics as they can be derived from simple additive metrics. These metrics have their components pre-aggregated hourly and will be finally computed directly at the query time.
For example, someone interested in comparing population’s click-through rates (CTR) should define an analysis whose metrics include both count of clicks and count of impressions (that are both simple additive metrics) and derive the CTR metric at query time as the ratio of these two metrics.
Bootstrapping makes it also very easy to handle confidence interval computation for such combined metrics. At query time, we compute for each bootstrap the value of the combined metrics based on the values of the simple metrics that compose them. This way, we get the bootstrapped values for the combined metrics that can be used to derive confidence intervals.
Up to this point, the analysis framework can be up and running: pre-aggregation will be computed each hour and we can query metrics and their confidence intervals via the Athena UI. Then, the last part to handle in order to make these analyses easily accessible is to create for each one the related dashboard within our company-wide BI tool.
Although it is not something that has to be done regularly (once when creating new dashboards or updating charts), composing the queries is quite tedious. Despite the fact that they are pretty intuitive to write, the queries are still very verbose and require users to have a certain level of knowledge in both statistics and SQL. That’s why we have developed a query generator as a final touch to the system to help abstract away the query detail and the repetition.
To generate a query, the user simply specifies an analysis type, some filters, some projections and some metrics, along with a confidence level for CIs. Metrics can be defined by basic arithmetic compositions between simple additive metrics, if needed. For example, the click-through rate can be defined by composing two simple additive metrics as follows:
nb_clicks / nb_impressions
A series of composite or simple metrics can be specified to the generator in this way. Once everything has been well defined, a query template is generated, ready to be plugged in the BI dashboards.
System performance and cost
The A/B test analysis framework has been deployed to production for a while and is used to analyze all A/B tests across different teams in Teads. Several analysis types are created as well as a number of BI dashboards.
We are quite satisfied with its overall performance. For a history of 7 days, it takes about 30 to 50 seconds for each individual query to complete. With a dashboard backed by dozens of queries, the estimated cost would be around 5 cents for each refresh. These are fairly reasonable numbers considering the scale of Teads today.
This A/B test analysis framework has been in production for several months and it proves to be a valuable addition to our system from many points of view:
- Analyses are more standardized. Some common analyses have been defined with their related dashboards for standard needs. These common analyses can be used by the different teams to monitor different business metrics. It helps to standardize the way A/B tests are analyzed and makes results more valuable as metrics are defined before A/B tests start.
- Analyses are more accessible. The previous tools (Jupyter notebooks and BigQuery) were not that easy to use and required some knowledge that made analyses not runnable by anyone. In our new framework, intuitive charts and tables are available in company-wide BI dashboards, making it easier to communicate to everyone, including non-technical team members. It also simplifies the onboarding of new team members on these subjects.
- Analyses are faster to run. Analysis data are pre-computed hourly and stored to be queried when they are needed. The time spent to set up and run an analysis is drastically lowered. More time can be spent on the results themselves and the decision-making process is shortened.
- Analyses are more reliable. Jobs that define the hourly aggregations and dashboards that display final charts can be reviewed by several team members, reducing the risk of errors. The important decisions we have to take are based on more reliable data.
Appendix: confidence intervals with bootstrapping
The need for confidence intervals
When analyzing an A/B test, one should keep in mind the random nature of the data that has been observed during the experiment. If the results of an A/B test show that a given metric is higher (or lower) for the test population compared to the reference population, it is legitimate to wonder if the difference we observe comes from the different settings we applied to the two populations or simply from the random nature of the observed samples. In other words, the question is: would we draw the same conclusions with other samples?
Our analysis framework has been designed to tackle efficiently this question of results significance by using confidence intervals. Confidence intervals are statistical tools that make it possible to quantify the uncertainty related to a metric derived from the data. More precisely, a confidence interval at 90% for a given metric is such that in 90% of the cases it will actually contain the true value of the metric we are estimating.
The bootstrapping method
When dealing with metrics that are additive and independent line by line, confidence intervals can be computed using some very standard statistical tools that derive from the central limit theorem (Student t-statistic, Welch t-statistic). These tools are really powerful due to their strong statistical foundations but they require hypotheses (independence and additivity) that are not met in our analyses. For example, the click-through rate (CTR) is computed by dividing the count of clicks by the count of impressions. Despite looking pretty simple, the ratio in the formula and the fact that some lines can be correlated (same user, for example) makes the statistical approaches derived from Central Limit Theorem useless.
In practice, the bootstrapping method is widely used to compute confidence intervals. As we mentioned, the whole problem comes from the fact that the population samples observed at the time of the A/B test are random and results could have been different with some other samples. Ideally, we would like to have several of these samples for both reference and test populations in order to compute confidence intervals based on it. Unfortunately, we can’t observe any other samples than the ones that actually happened.
The idea of bootstrapping is to generate artificial samples (called bootstrap samples) of the same size as the original one to compensate for the fact that only one sample can actually be observed. For each population, these bootstrap samples are generated by sampling with replacement from the original sample. The assumption made here is that, for each population, the empirical distributions of the observed samples are close to the true underlying distributions. So, sampling (with replacement) from these observed samples is a good approximation to sampling from the distributions themselves.
Once the bootstrap samples have been generated, bootstrapped values of the metric of interest can be computed just as we compute the metric value from the original sample and a confidence interval can then be derived. To do so, several methods can be found in the literature with different degrees of mathematical justifications and different empirical results. Among the most common methods, we can mention:
basic bootstrap(or reverse percentile interval)
t-interval bootstrap(or studentized bootstrap)
respectively the original sample and the B bootstrap samples, and
respectively the estimator of the metric of interest (e.g. the total revenue) for the original sample and for the bootstrap samples. We can estimate for any value of α the α-percentile of the bootstrapped estimator distribution, denoted
Then, for a confidence level of 90% the basic bootstrap confidence interval is given by
Although it is a very convenient method to compute confidence intervals, bootstrapping has one major drawback: under its basic form, it is not compatible with streaming or pre-aggregated data. Bootstrapping requires indeed to know the entire samples of populations to generate the bootstrap samples, which is not possible when processing each hour separately. Poisson bootstrapping brings a practical workaround to generate bootstrap samples when dealing with streaming data.
Let’s assume an original population sample of size
n. The vanilla bootstrap sample generation consists in sampling with replacement n times from this original sample. This is mathematically equivalent to drawing
n times from a multinomial distribution where
It also means that the number of times a given individual of the original population sample appears in a bootstrap sample follows a binomial distribution
Based on the following equation
the Poisson bootstrap approach relies on the approximation
B(n, 1/n) ≈ Poisson(1) (that holds for large values of
n) to generate bootstrap sample as follows: each individual encountered is assigned a random weight drawn from a Poisson distribution of parameter
λ=1 that corresponds to the number of times the individual appears in the generated bootstrap.
At this point, it is interesting to notice that for large values of
n, poisson bootstrap method approximates a random sampling with replacement pretty well. In particular, we can mention that the expected size of a bootstrap sample generated this way is the same as the size of the original sample due to the Poisson distribution property
But it is even more interesting to outline the fact that we made n disappear from the bootstrap generation process, making it possible to deal with bootstrap samples in an online fashion.
Let’s for example consider streaming data and assume that we want to compute for one metric both the raw sum and three bootstrapped values of this sum. To do so, the four sums can be initialized to 0 and updated as follows: each new value observed is simply added to the current raw sum and added with weights sampled from a
Poisson(1) distribution to the bootstrapped sums.