Attribution analysis: How to measure impact? (Part 1 of 2)

Lisa Cohen
Data Science at Microsoft
9 min readAug 13, 2020

By Lisa Cohen, Saptarshi Chaudhuri, Daniel Yehdego, Ryan Bouchard, Shijing Fang, and Siddharth Kumar

A common question we hear from business stakeholders is “What is the impact of x?”, where x may be a marketing campaign, a customer support initiative, a new service launch, and so on. It makes sense — as a business, we continuously make investments to nurture Azure customers. In each instance, we want to understand the impact or effectiveness to help guide our future direction.

Controlled experiment

The most effective approach to quantify the impact of a change (or “treatment”) is to run a randomized controlled trial (RCT). We identify the set of customers for the experiment and randomly divide them into treatment and control groups, making sure confounding factors like customer size, monthly usage, growth rate, license type, and geography are equally represented in each group. Then we compare the outcomes of each group with each other. This approach allows us to evaluate causality and determine the “lift” from the program.

Defining “success” is a key step in this process and is a good way to clarify goals. In Azure, our overall goal is for customers to be successful in their adoption of the cloud. Below is an example comparing Azure usage between two populations as an indicator of engagement. Usage is an indirect measure of a customer’s success, because it demonstrates that they’re able to leverage and find value in the service versus experiencing blockers. Therefore, in this post, we refer to the customer’s usage as a proxy for their success. (Note: For features like Azure Cost Management, focused on helping drive efficiencies, we can actually measure success as a decrease in overall usage.) Additional success metrics that we generally consider include retention, net promoter score, and satisfaction. There are also scenarios where we consider success metrics that are more specific to the particular focus of a program, for example, to help facilitate activations or deployments.

Next, we choose the appropriate statistical test to evaluate the treatment’s performance. In this case, we use Student’s t-test to check that the p-value is below a predetermined alpha level. We most often use an alpha level of 0.05 and assess the range of potential impact with a 95 percent confidence interval. However, we also keep in mind the potential shortcomings of frequentist statistics that could warrant the use of over-comparison corrections or Bayesian techniques. (Note: In cases of non-Gaussian distributions, we also leverage nonparametric tests such as the Mann-Whitney U test.)

To determine the amount of impact the treatment had, we measure the area between these two curves after the treatment was applied. Finally, if we want to understand the return on investment (ROI) of the treatment, we can divide the incremental consumption that it produced by the cost:

Retrospective cohort study

In some cases, however, an experiment might not be possible. For example, we may be working with similar customers, and we don’t want to unfairly exclude any of them from a core customer benefit. But when we have rich observational data, we could measure impact through various observational studies. For example, we use a retrospective cohort study to measure the correlation of a customer nurture initiative (i.e., the “treatment”) on the output variable by comparing trends before and after the treatment using a single population analysis. Since the treatment might have been applied to different customers at different calendar months (depending on each customer’s lifecycle), first we start by normalizing the customers’ usage by the date of the treatment. We construct the chart below to confirm our intuition that a correlation actually exists between the treatment and Azure usage. In addition to the mean, we also check the median, as well as the twenty-fifth and seventy-fifth percentile (“box plot”) versions of this chart, to understand how the treatment affects different parts of the customer population. While in the end this is still a correlation, normalizing by the treatment date rather than a calendar date — when the treatment occurred at different points in time — helps exclude other time-based confounding variables. A curve like the one below (where the trajectory shift starts at the time of the treatment) helps us be more confident that the change is related to the treatment instead of to other confounders.

The other value of this chart is that it helps establish whether we’ve considered the correct “event” from the initiative as part of the treatment’s impact. For example, when evaluating the impact of support, do we check the support plan entitlement date, the ticket open date, or ticket close date? Similarly, when a technical professional engages with a customer to consult on a project, do we consider the engagement start date or the project deployment date? Finally, is there a typical “delay” or “ramp up time” from the time of the treatment event to the point of measurable impact? Using the perspective from these charts, we can conclude that the program correlates with helping the customer use Azure, when we see an increased growth rate at that point in time.

Now (just as in the experiment case), beyond determining whether the treatment correlates with a statistically significant difference or not, we also want to know how much of a difference exists. To measure the amount of impact the treatment had overall, we construct a view as follows. First, we forecast how much the population would have continued to grow on its own. We test multiple forecast techniques for this and choose the one that works best for this data set. The test includes an SMAPE calculation and divides the pre-treatment timeline as 70 percent training and 30 percent test to evaluate forecast performance. Once we have the forecast “baseline” defined, we compare the actual growth of the population and compare. In the end, we attribute the shaded area between the two curves to the treatment.

In employing these approaches, we avoid a couple common “gotchas”:

  1. Comparing the growth rate of a non-randomly selected group with the treatment group. Sometimes when there is no planned control, there’s a temptation to compare the growth rates between populations who received the treatment with those who did not. However, often there are specific program criteria for the treatment. Therefore, there is an inherent sampling bias in this approach.
  2. Attributing all of a customer’s growth to the treatment (without considering that the customers might still have grown without the treatment). We use the forecast baseline method, as outlined above, to avoid this over-attribution.

In the retrospective cohort approach above, we caution the reader that this is still a correlation (versus a causal impact). Causal inference is an additional technique to determine causation, which we will explore in our next post.

Multi-attribution analysis

An additional complexity that we encounter in these analyses is when multiple treatments are applied to the same customers at the same time. In “real world” enterprise scenarios, this is often the case, since there are multiple teams, programs, and initiatives all working to help customers in different ways. For example, here is an illustration of a customer who engages with multiple programs over time:

To determine the impact of each individual program, we need to apply multi-attribution analysis.

One way to check whether this is required is by analyzing the overlap among programs. For example, what percentage of customers who experienced treatment A also had other treatments?

Note that this scenario reinforces the importance for data science organizations to bring together broad data sets in order to represent the end-to-end customer experience, as well as the need for a customer model that allows them to be connected with common identifiers, in order to produce these kinds of insights.

If the overlap among programs is small enough, and the action we plan to take doesn’t require extreme precision, we may choose to proceed with the single attribution analysis, knowing that the results are still directionally relevant. If the overlap is material, however, as in the example above, we apply multi-attribution approaches as follows.

First, we forecast the dynamic baseline, starting at the point of the first treatment. Then, we use proprietary machine learning models (based on Markov chain and Shapley value approaches) in order to divide the area between the baseline and actuals during the period where multiple treatments are present:

Fig. Usage attributed to two treatments, beyond the dynamic baseline. (Visual by Elizabeth Kronoff.)

Investment programs can be considered as a stochastic process where their sequence of events are treated as a Markov chain in which the probability of each event depends only on the state attained in the previous event (Paul A. Gagnic, 2017, and Markov). The Shapley value method is a general credit allocation approach in cooperative game theory. It is based on evaluating the marginal contribution of each investment in a customer’s journey. The credit assigned to each investment, i.e., Shapley value, is the expected value of this marginal contribution over all possible permutations of the investments.

Using this approach, we can conclude the correlated usage for each respective program.

Additional attribution scenarios

Web page attribution: In addition to the customer nurture activities described above, another scenario involving attribution analysis is in web analytics. Tools like Google Analytics and Adobe Analytics use heuristic (rule-based) multi-channel funnel (MCF) attribution models, which include the following methods:

  • First-touch attribution: Attributes all credit to the first touch point of the customer’s journey.
  • Last-touch attribution: Attributes all credit to the last touch point of the customer’s journey.
  • Linear attribution: Attributes the credit linearly across all touch points of the customer’s journey.
  • U-Shaped attribution: Attributes a fifty-fifty split of the credit to the first and last touch points of the customer’s journey.
  • Simple decay attribution: Attributes a weighted percentage of the credit to the most recent touch point in the customer’s journey.

However, the challenge with rule-based models is that you must know the correct rules to choose. Therefore, we’ve researched data-driven models for these scenarios as well. In our scenario, we want to understand the impact of our websites and documentation in helping users adopt and engage with our products. Using a Markov chain approach we’re able to observe the difference in conversion rates between those users who do and don’t visit our web pages, as well as determine which pages correlate with the strongest outcomes.

Customer satisfaction attribution: Another application of attribution analysis comes up in the case of customer satisfaction (CSAT). We typically learn about our customers’ satisfaction through survey data. By asking customers about their levels of engagement with our product and communities, we can then correlate those experiences with their overall satisfaction. Here is some sample data to illustrate this scenario:

Data visualization by Nancy Organ

While both this and the previous web example are correlation analyses, they do tell us about the prominence of particular web pages with users who do versus don’t convert, as well as the product and community engagement, for users who have high versus low CSAT. Given this, even if we don’t prove that Document A caused a customer’s conversion, the fact that it is frequently visited by users who convert means that we should likely invest in it.

Conclusion

In this article, we walked through methods of attribution analysis for both single- and multi-attribution scenarios. We explored an example in the context of a customer engagement program, and also shared references to web page attribution and customer satisfaction surveys as additional use cases where they are applicable. Beyond the initial RCT example, this article primarily focused on correlation. In the next article in this series, we’ll dive into causal inference approaches to determine causality.

We’d like to thank the Marketing, Finance, and Customer Program teams for being great partners in the adoption of this work.

--

--

Lisa Cohen
Data Science at Microsoft

Lisa Cohen is an experienced leader of Data Science & Engineering orgs, with roles as Head of Data Science at Twitter and DS for Microsoft Cloud.