How to deal with multiple geos and multiple KPIs in Marketing Mix Modeling

Romain
fifty-five | Data Science
6 min readAug 31, 2022

State-of-the-art models developed for Marketing Mix Modeling (MMM) rely on regressions (Robyn, Lightweight, Uber, …). The idea is to model the KPI, for instance the national sales, according to the media investments and some external factors like trend, seasonality, and holidays. Within this regression, it is commonly assumed that the relationship between the KPI and the media investments is not linear but linear up to some transformation of this investment. A classical transformation is to apply a hill function on an adstocked value of the GRP / reach / spend.

Up to this stage, everything works fine and everyone is happy. However, MMM projects often come with more objectives than just measuring the media impact at a global national level. We would also like the impact per region (or geos). Turnover is nice, but what about the impact on the number of customers, or the number of sales? One could run as many independent models as there are objectives. Let’s assume there are 10 geos and 3 KPIs. So we have 33 models. 30 for each KPI on each geo + 3 for each KPI at the national level. Several problem arise:

  1. MMM are complex models where result significance and stability can be hard to achieve. So achieving it for 33 models at the same time and still ensuring a global coherence can be very tricky.
  2. MMM have little data. In a world of big data and deep learning, MMM is quite far from it. Usually we rely on a few years of historical data with a weekly granularity. Assuming 4 years of data, it is 4*52=208 data points per model.

So, what to do? By removing the independence between the models we will help them to converge more easily. To do so, we rely multi-task learning.

Multi-Task Regression

In 2004/2005, Evgeniou et al wrote two papers ([1], [2]) to deal with multi-task regression.

Let us introduce some notation:

  • T the number of tasks.
  • n the number of data points per task (all tasks have the same number of data points).
  • p the number of features (same for each task).
  • w_t the weight vector of task t.
  • x_t,i the feature vector associated with observation i of task t.
  • X_t the input matrix with all the feature vectors of task t.
  • the input matrix with all the feature vectors of task t. We assume that the features of each task live in the same space.
  • y_t,i the target associated with observation i of task t.
  • Y_t the target vector with all the targets of task t.

Considering each model independently from each other, we would have to solve for every t:

And writing all those problems together will end up with the following loss to minimize

Ok that’s great but the tasks are still independent and both writings are equivalent. But we assume that the features of each task live in the same space. In other words, if the first feature of task 1 measures the spend on TV, the first feature of each task measures the spend on TV (up to a transformation using hill function and adstock when talking about MMM). We also assumed that the tasks are not independent. This can be captured through a correlation graph where two tasks are linked if they are correlated and each arrow has a weight proportional to the correlation. This can be modeled with a correlation matrix (positive definite). Let G be this graph.

What can we do from that? If I told you that two tasks encoded by the same variables are correlated, what can you assume about their weights? They should be close. So let use this information in the loss:

The stronger the value of G_j,q the closer w_j and w_q. This prevents the model from exploring a too vast space and increases the coherence between each task.

In MMM, we can consider each geo as a task to apply this modeling technique. But we did not solve it all. We also have multiple KPIs !

Multiple KPIs

Let us introduce new notation to generalize the problem to multi task and multiple KPIs.

  • T the number of tasks.
  • K the number of KPIs.
  • n the number of data points per task per KPIs (all tasks x KPIs have the same number of data points).
  • p the number of features (same for each task x KPI).
  • w_t,k the weight vector of task t for KPI k.
  • W_t is the weight matrix for all KPIs of task t.
  • W is the weight tensor (3D matrix in our case) for all KPIs of all taks.
  • x_t,i the feature vector associated with observation i of task t.
  • The feature vector of all KPIs for a given task is assumed to be the same. For MMM if you want to predict sales and turnover, the feature will be the same (same spend/GRP per media, same time of the year).
  • X_t the input matrix with all the feature vectors of task t. We assume that the features of each task live in the same space.
  • y_t,i,k the target associated with observation i of task t and KPI k.
  • Y_t,k is the target vector with all targets of task t for KPI k.
  • Y_t is the target matrix with all targets of task t for all KPIs.
  • Y is the target tensor for all KPIs of all taks.

Need a break? Basically, we just added a dimension to all parameters except for the features.

Let first write the loss function as if we consider everything independently:

Now introduce the graph dependency between the task (G_T):

Cool! But are all those KPIs independent? If the number of sales increases, the turnover should increase right? So we can add a second graph that captures the correlation between the KPIs (G_K):

We choose to write the regularization part this way to ensure that correlation happens either between different KPIs of the same task or between different tasks of the same KPIs. One could cross everything using a quadruple sum weighted by the product of the two graphs but it may introduce correlation between problems that are too far apart.

This loss function can then be optimize using usual classic gradient descent to obtain the weights W.

To sum up, we choose how to reduce the search space when training multiple MMM for different geos and KPI in order to converge more easily and improve the coherence between the results of each sub model. In order to construct the graphs, one can for instance compute the correlation between the observed target values of each geos and KPI.

--

--