Identifying Control Pairings and Calculating Lift for Ad Campaigns

Published in

IBM Data Science in Practice

10 min readNov 3, 2021

Written by Ainesh Pandey, Maleeha Koul, Bobby Kelsey, and Gabriel Gilling from IBM’s Data Science and AI Elite Team

three panel cartoon with two characters. in the first panel, the character without hair says “I used to think correlation implied causation.”. in the next panel the character says “then I took a statistics class. Now I don’t.” in the last panel, the other character says “sounds like class helped”, and the main character replies “well, maybe.”

When running an ad campaign in a test population, the very first step of selecting the control population can make or break the efficacy of your lift analysis. Unfortunately, the realities and costs of having data from several control populations discourage organizations from systematically identifying the right pairings. Regardless, the fact remains that an unsuitable pairing will actualize the popular computer science adage of “garbage in, garbage out” single-handedly, muddling all attempts at identifying the exact impact of the ad campaigns in the first place.

In this blog post, we will identify statistical tests that evaluate the suitability of test and control population pairings and calculate lift resulting from an ad campaign run in the test population, within the context of a large media client we will call MediaCorp. We will use some domain-specific language, which can be identified as follows:

test: the population that was subjected to the ad campaign
main control: a population as similar in trends to the test as possible, that abstained from exposure to the ad campaign
auxiliary controls: additional influences to the output variable that need to be controlled for (e.g. for measuring revenue from Home Depot appliances, controlling for interest in Lowe’s appliances as a measure of general interest in appliances)
pre-period: data observed before the ad campaign, from both the test and control populations
post-period: data observed with or after the ad campaign, from both the test and control populations
factual: data actually seen in the test population in the post-period, given the ad campaign has run
counter-factual: data a forecasting method expects to see in the test population in the post-period if the ad campaign had not run, extrapolated from the pre-period
lift: the calculated change in the output variable (usually revenue or traffic) as a direct result of the ad campaign

Statistically Identifying Control Populations

MediaCorp has run many ad campaigns in many American cities; therefore, they have an intuitive understanding of which cities behave similarly and would be ideal markets for comparison. However, their intuition was not backed by statistical rigor. Here, we offer concrete evidence for the suitability of markets as controls.

MarketMatching

A common method for finding optimal control and test pairs is to calculate the Euclidian distance between the test and control time series. While valid, this approach does not account for any temporal shifts or lags that may be present in the data. This begs the question: how do we account for lags within possible control and test market pairings? This is where the MarketMatching in R package comes into play.

two comparisons of two time series. One is Euclidean matching where all the lines between the two curves are straight. The other is dynamic time warping where the lines are not. — Comparison of Euclidian Matching vs. DTW Matching

MarketMatching provides a more robust approach by using dynamic time warping (DTW) to compare two time-series sets. DTW analyzes the potential shifts and lags between the control and test series and calculates the distance along the warping curve. The package also incorporates the CausalImpact library, which will be discussed later as an effective method for determining lift.

Once you run the best_matches function from the MarketMatching package on the data, you can access the results in the following format:

MarketMatching results

The Ind and BestControl values refer to the different cities tested by the package in the order they appear in the dataset
Lower RelativeDistance and higher absolute Correlation are indicative of better matches
Based on the RelativeDistance column, Indicator 2 is the best control city for Indicator 1. However, it does not necessarily follow that Indicator 1 is the best control city for Indicator 2 (that would be Indicator 4). It is important to note that in the case of similar RelativeDistance values, refer to Correlation as a second confirmation.

In our experience with MediaCorp, we found that the MarketMatching technique works particularly well for identifying the main control, not the auxiliary controls.

Stability Metric

The parallel trends assumption states that “although treatment and comparison groups may have different levels of the outcome prior to the start of treatment, their trends in pre-treatment outcomes should be the same.” In this assumption, both directionality and scale of the trends should be consistent between the test and control populations. You can perform a ratio analysis, one that reports a metric of stability, to intuitively test the consistency of the relationship between your test and control populations.

Sample Data

Let’s take this sample data as an example, where:

Y0 represents daily traffic from the test group
Y1 represents daily traffic from a first potential control group
Y2 represents daily traffic from a second potential control group

The ratio analysis involves taking the ratio of the output variables from the test group and each potential control group at the recorded interval (daily) and calculating the standard deviation in relation to the mean.

Ratio Analysis

The final value, calculated by dividing the standard deviation by the mean, offers a metric of stability. Essentially, the lower the value, the more stable the relationship between the two groups. Based on the analysis, we find that Y1 serves as the better control group for Y0.

Unlike the MarketMatching package, this approach does not use DTW to better identify parallel trends. However, we find that this stability metric is useful for selecting auxiliary controls.

Calculating Lift

To calculate lift, we can proceed with a couple of different approaches.

The Forecasting Approach

a windsock blowing in the wind. The windsock is attached to a tower and there is a partly cloudy sky in the background on a sunny day. — Photo by Mark König on Unsplash

This first approach entails training forecasting models on the pre-period data to project the counter-factual in the post-period. Essentially, we compare what we expect to see, had the ad campaign NOT run, to what we do see after running the ad campaign. In our analysis, we used the following forecasting tools:

FBProphet Forecasting
ARIMA modeling

FBProphet Forecasting
Facebook Prophet is a powerful open-source library for forecasting time series data. The library can handle a fair amount of data and make accurate predictions using simple parameters, such as seasonality and federal holidays. After modeling on the pre-period, we forecast the revenue in the post-period as the counter-factual and compare it to the factual. The average difference between the counter-factual and the factual defines the lift.

The library has support for both Python and R.

a chart with numbers on the Y-axis going from 0 to 50,000 with ticks at every 10,000. The X-axis is months from July to December. From July to September, there are dots that mainly fall in the 0 to 20,000 range on the Y-axis, with a few outliers in September at 22,000 and nearly 50,000. From September to December, there is a steady oscillating curve between 0 to 10,000 with a large peak in December at about 25,000 — FBProphet forecasting the counter-factual in the post-period using factual data from the pre-period

ARIMA Modeling
ARIMA modeling is a very common frequentist approach to time-series forecasting using historical information in the pre-period to estimate the values in the post-period. The ARIMA (auto-regressive integrated moving average) model is comprised of three order parameters: (p, d, q).

The autoregressive parameter, p, derives the forecasted variable as a linear combination of past values of the variable.
The integrated parameter, d, uses a degree of differencing which reduces seasonality from the time series, thus making it stationary.
Finally, the moving average component, q, uses previous error terms to predict future observations.

Regression Approach

a string of balls that can swing back and forth when hit by one on the end. — Photo by Nadir sYzYgY on Unsplash

The second approach involves programmatically decomposing the output variable into various effects. For example, if we see revenue of $10,000 in our test city in the post-period, we might establish that:

$6,000 of it is the baseline value, the revenue observed in the pre-period for the control city
$1,500 of it is because of the test city
$500 of it is because of the post-period
$2000 of it is because of the lift, which is the value that we want to calculate

In our analysis, we used the following statistical tools:

Causal Impact
Difference in Differences (DiD) analysis
BaRT (Bayesian Regression Trees) Cause

For DiD, the underlying mechanism will be a regression analysis, which poses some concerns. Notably, some modifications must be made to our analyses if there is any stationarity or autocorrelation in our data, two common presumptions of time-series data.

Causal Impact

CausalImpact is a state-of-the-art algorithm developed by Google that infers the counterfactual by using the test city’s pre-period trend, as well as covariates in the same timeframe. It uses a Bayesian framework, with priors on the estimated parameters and a Markov Chain Monte Carlo algorithm to infer the posterior distributions of the parameters and the target variable.

Three graphs. The first is labeled “original” and has Y-axis range from 0 to 60,000. The “original” graph has a curve that oscillates mainly between 0 and 20,000, with two peaks at about 40,000 and above 60,000. The second graph is labeled “pointwise”. It has a graph that oscillates between -20,000 and 20,000 with a peak at 40,000. The third graph is labeled “cumulative”. It’s curve is flat at 0 until about halfway where it begins to grow at a linear pace and ends at about 35,000. — Author’s Image: CausalImpact Results (counterfactual vs actual-counterfactual vs cumulative sum of lift)

CausalImpact has the advantage of being very easy to use, providing robust lift estimates and great visualizations. In order to maximize its efficiency, it is important to pass as many auxiliary control variables as possible, as long as they were NOT affected by the treatment.

Difference in Differences (DiD)

DiD is used to estimate the effect of a specific treatment, like running an ad campaign on the radio, by comparing changes in outcomes over time between a treatment group and a control group. This technique has its roots in econometrics and political science, where it was used to measure the impact of new government policies. Although DiD has primarily been a tool for causal inference problems, many techniques have built on the underlying concept.

a graph where the Y-axis is labeled “outcome”, and the X-axis is split into “pre-intervention” and “post-intervention. There is a curve that is lower that grows linearly. The curve above grows linearly in parallel during the pre-intervention page. During the post-intervention period, the higher curve grows at a faster rate than the lower curve. — Source

For DiD to be a good fit for your causal inference problem, the parallel trends assumption must be met. As shown in the diagram, in the pre-period, both groups follow parallel trends with a constant difference. In the post-period, the treatment group exhibits a change from the usual trend, which is attributed to the ad campaign. The DiD variable captures the measure of this difference, otherwise identified as the lift.

This model is implemented using a simple regression technique that contains a Boolean treatment indicator (1 for test, 0 for main control) and a time period indicator (1 for post-period, 0 for pre-period). DiD is encapsulated as an interaction term between the treatment and time variable. The regression equation would be

Beta-zero term plus beta-one term times time plus beta-two term times intervention plus beta-three term times D i D plus beta-four term times covariates plus epsilon

The following code snippet implements the model in R:

DiD results

A positive coefficient for the DiD feature signifies lift. More auxiliary control variables can be added to the regression model as covariates in the regression equation.

BaRT Cause

You might run into situations where you do not have access to a pre-period — perhaps due to resource constraints — but you would still like to estimate lift. Without a pre-period, you cannot perform counterfactual inference, since there is no data to train on, and you cannot perform DiD, since it compares the means of the test and control cities before and after the marketing campaign.

A solution would be to run a simple regression, where the treatment variable is regressed on revenue in both the test and control cities. However, this would only yield an average difference in both cities and is a long shot in properly assessing the marketing campaign’s causal effect on revenue.

Instead, you might want to use an algorithm like BART (Bayesian Additive Regression Trees). It performs similarly to the Gradient Boosting Trees algorithm used in Machine Learning frameworks, except that it incorporates a Bayesian prior when fitting trees. Not only does it have the advantage of not requiring the assumption checking that is needed in DiD modeling, but it also offers strong inferencing with low standard errors/uncertainty, all while being easy to implement.

a graph with a y-axis with a range of 20 to 45, with tick marks every fifth number and an x-axis with a range of 0 to 100, with ticks every 20th number. The points between 0 to 60 on the x-axis are mostly between 20 and 30 on the y-axis, until they follow exponential growth between 60 and 100 on the x-axis. Above is shown a binary-branching decision tree starting with a node labeled “x < 80”. The next level is “mu equal to 26.2” or “x<90”. Below “x<90” is “mu=36.3” and “mu=42.8” — Source

You can still run BART if you have a pre-period. In fact, when we tested it on the same data used with DiD, we obtained similar results, with the added benefit of obtaining smaller standard errors.

Results

For some context, MediaCorp carried out ad campaigns for their own client we’ll call XYZ in two test cities: Raleigh and Minneapolis. They were asked by XYZ to use Nashville and Houston as controls, respectively. However, MediaCorp was confident that Houston would not serve as a proper control for Minneapolis, but they carried out the lift analysis anyways and reported significant lift for Raleigh and inconclusive lift for Minneapolis. Due to a lack of statistical rigor in the control selection phase, MediaCorp was unable to demonstrate the true effectiveness of their ad campaign in Minneapolis to XYZ.

Using the MarketMatching package and ratio analyses, our team was able to provide concrete evidence for the claim that Houston was not a suitable control population for Minneapolis. Actually, Nashville emerged as the best control candidate for both test cities. We then carried out several different lift analyses:

a graph labeled “estimation of absolute sales lift in $” with +/- 1 standard error with 50% confidence intervals. The Y-axis is labeled “sales lift in $” and the x-axis is labeled “analysis”. This graph shows the evaluations of FB prophet, ARIMA, CausalImpact, and DiD for the test pairings of Raleigh/Nashville, Minneapolis/Houston, Minneapolis/Nashville, and Minneapolis/Houston. — Author’s Image: Results of lift analysis for XYZ

As evidenced by the plot, we can see consistent lift for both Raleigh and Minneapolis when compared to Nashville. Consistency is defined by the following traits:

All analyses report lift in the same direction (above or below 0)
All analyses do not include 0 in their confidence intervals

Conclusion

a ruler with inches on the bottom and millimeters and centimeters on the top — Photo by Roberto Sorin on Unsplash

The world of pre-packaged open-source generalized solutions has simplified statistical analyses, but many practitioners erroneously believe that these packages serve as a panacea for their analytics problems. The fact remains that analysts cannot just throw a package to their data and expect accurate results. In the case of machine learning problems, practitioners undergo a stringent exploratory data analysis phase to ensure the data aligns to the problem and the proposed approach. Similarly, the existence of packages like Causal Impact and ARIMA does not eradicate the need for statistical rigor guiding your approach. These packages are only as effective as the quality of data provided to them.

Introducing statistical rigor to the control selection process pays calculable dividends in the lift analysis stages through one clear outcome: consistency. Clear results across several different lift calculation methods offer confidence in the integrity and reliability of your analyses. XYZ erroneously believed that their ad campaign through MediaCorp in Minneapolis did not pay dividends due to a flawed methodology for selecting control populations. However, by using statistical rigor to back up control pairings, MediaCorp was able to accurately demonstrate the true impact of its marketing campaign.

We hope this post has illuminated the importance of rigor in calculating lift, and some pointers in how to practice more statistical rigor in your own practice. Stay tuned for more posts from IBM’s Data Science Elite and tutorials on common problems in Data Science.