**Effective Control Selection along Time Series**

Understanding causality is critical to the process of improved decision making. When we need to test a hypothesis that ‘A causes B’, the most rigorous way of inferring this causality is through a randomised controlled experiment:

*Treatment A is assigned to experimental subjects randomly, producing a test group (those treated with A) and control group (those not treated with A) with probabilistically similar distribution on all other features. If outcome B is observed to a significantly greater degree within the test group, then the causality of A to B can be established in principle.*

However, using randomised controlled experiments for causal inferences is often not an option in an industrial setting, due to cost, time or practicalities. Thus, instead of having a pre-assigned control group, we are often forced to select controls from a pool of known non-treated candidates. This control selection process is a crucial step for causal impact analysis.

In this post, I will review control selection specifically in the context of marketing analytics. Evaluation of marketing performance is an essential, but challenging task for many businesses because:

1.People want to know beyond ‘whether A leads to B’ and have a quantified impact of A on B:

- How much sales uplift has an advertising campaign brought to the brand supplier?
- How many new customers has a retailer gained because of a new product release?
- How much does the conversion rate of an e-commerce platform change with a new user interface?

2. **T**he observations cannot be simply concluded as a before-and-after status comparison but considered as a series of dynamic time series. In both cases, good controls should be as similar as test groups as possible, namely fewer confounding variables — factors potentially influencing performance that differ between the control and test conditions. With the dynamic case, on top of controlling for confounding factors, control of high quality should also:

- behave in line with the test population across the whole pre-intervention period
- be a good estimator for counterfactual modelling, the process of inferring what would have happened under an alternative treatment

Let’s walk through an example of time series control selection with a specific marketing intervention:

In order to evaluate the impact of the shelf banners, we should first find a good control group among a pool of candidates — stores selling the same products, but without shelf banners installed during the activity period. How do we find the best control stores from this pool to most accurately evaluate the impact of this in-store marketing campaign?

**Stratified Sampling**

Instead of randomly selecting stores from the control pool, stratified sampling is an effective approach to limit confounding variables. ‘Stratified’ means that test and control candidate stores are put into buckets based on similarity of potential confounding factors, such as store format, product sales or customer segments. The proportion of control stores in each bucket should then follow the same distribution as the test stores. This is illustrated below.

However, this approach relies heavily on domain expertise to pick the right features and thus is hard to automate across different scenarios. More importantly, the selected controls are not necessarily good linear estimators for counterfactual modelling as their performance haven’t been measured against the test group during the pre-period.

**Time Series Matching**

A more flexible and robust alternative is using time series similarity measurement: compare historical time series (eg. sales of products promoted in the campaign) of test stores prior to the campaign against each control store candidate. (Time-series matching should only happen in the pre-period and without having seen data for the post-period.) The best matched control stores are then selected for each test store to construct the final control group. When dealing with a large pool of candidate control stores, using stratified sampling as a pre-screening step to trim the list in advance will reduce computational cost.

Similarity can be defined from two perspectives:

**1. Trend similarity** shows whether two markets go up and down together, eliminating factors like seasonality or promotion strategies as well as selecting effective estimators in a linear model.

**Pearson correlation coefficient**is the most popular measure of correlation, using the covariance of two random variables divided by the product of their standard deviations.

**2. Distance measurements** calculate absolute differences, adding more controls for factors like sales of target products or store size. There are many variations for distance metrics, each with their own strength. Popular measures include:

**Euclidean Distance**, simply the straight line between the two points,

**Time Warping Distance**(also known as Dynamic Time Warping, or DTW),

**Levenshtein Distance**(Edit Distance) is frequently applied in automatic spelling correction, and it measures the number of the edit operations (like insert, delete, or replace) involved to transform one series into another. This group of methods permit gaps, thus allowing more flexible transformations of two series with different sampling rates. There are many variations, such as: EDR (Edit distance for Real Sequence); ERP (Edit distance based on Real penalty); TWED (time-warped edit distance).

Below is an example of the matching results for one test store based on three types of times series similarity measurements: Pearson correlation; Euclidean distance and Dynamic time warping distance

**Evaluation of Distance Measures**

So, the inevitable question is which distance metric is best. Unfortunately, the answer is that there is no panacea and complicated methods do not necessarily outperform the simpler ones. If you have series of the same length, with no missing data point, and no timeline shifts, then time warping or edit distance can give worse results than simpler measures like Pearson correlation. Also, the choice of the distance metric should be contextualised within the model used for the measurement.

In the context of our marketing example above, counterfactual modelling uses the selected controls as features to estimate what would have happened after the campaign if the test market was under the same condition as control markets. If you put pre-period series of selected controls into a regression model for training, with each predictor standardised, you will get the feature importance coefficient for each control. This coefficient indicates the effectiveness of each control markets in the prediction model. We can then measure how this importance aligns using different distance measures — the correlation between feature importance coefficients of each control market and their distance ranks by different measures. Higher correlation means the distance measure selects better estimators for modelling.

Here is an example of this analysis performed across 26 marketing campaigns in my experiments, where Pearson correlation and Euclidean distance beat the remaining measures, as assessed by higher coefficient value.

#### Conclusions

In summary, the key steps in identifying adequate control set include:

- Starting with a control pool of potential candidates unaffected by the campaign under evaluation.
- Using stratified sampling is a good pre-screening process to narrow down the candidate pool, but it alone cannot select good estimators for counterfactual modelling.
- A wide range of time series distance measures, like Euclidean distance, dynamic time warping and edit distances, can then be employed for more effective control selection.
- The performance evaluation of distance measures can be indicated by the feature importance of controls selected using each distance measure.

Control selection for marketing campaigns is an important but challenging process. However, with careful steps in place, many of the external factors that would otherwise lead to an incorrect evaluation can be isolated and accounted for, leading to more a robust assessment and, ultimately, better decisions.