The persistence of memory: stabilizing predictive models for better interpretability

Published in

Bluecore Engineering

12 min readJun 1, 2018

Bluecore’s predictive models allow retail marketers to better understand their customers’ interests and engagement behavior. In addition to using the outputs of our predictive models to directly generate audiences and email campaigns, there is also a lot of value in monitoring these metrics over time. Studying trends can help marketers answer questions such as: how effective are my email campaigns? How is my business doing overall, and are there any new trends/patterns I should be aware of? This approach even allows for intermediate goal setting that takes business context into account (e.g. “decrease the number of at-risk customers in the next quarter”).

The main tool for carrying out such complex analyses using historical data is Bluecore’s Audience Insights. In addition to “hard” metrics such as spend/engagement/conversions, Audience Insights also tracks the outputs of our predictive models over time. In particular, we can look at the number of active customers over several time periods. A hypothetical graph may look something like this:

figure 1: percentage of active customers in a 30 day period.

The above graph is the result of running the model for predicting active customers (Bluecore’s lifecycle stage model) for 30 consecutive days. The curve is showing a general upward trend, so if we take the perspective of a retail marketer, we may take this as a sign that we are doing a good job, and more and more customers are engaging with our site. On the other hand, there are also a lot of short term fluctuations: for example, there is a sharp jump upwards on 03/07, and we may ask what caused this sudden change. If we just launched a new email campaign on that day, this could be useful feedback that this campaign was hitting the mark in terms of messaging, product recommendations, or segmentation — let’s send more of those emails in the future!

However, there is a problem with this argument: in our case study, we had to re-train the model on each of the 30 days. We will discuss in more detail what “training” a model means, but for now you can think of it as the model changing a little bit every day, to ensure that it can keep up with changes in the data and remain accurate. So, if the model changes every day, how can we know that those ups and downs in our graph are real changes in our customers’ behavior, and not just the result of the model itself changing each time, thereby adding some fictitious noise to our graph? If we were more pessimistic, we could even ask why we should trust the general upward trend and draw any conclusions from it? This is the kind of thought that keeps data scientists up at night, so when we encountered these questions, it was worth investigating more carefully what’s going on.

The Lifecycle Stage Model

Bluecore’s definition of an active customer is any customer who is within their typical buying cycle. To illustrate what this means, let’s look at a concrete example: assume there is a customer (Alice) who buys a beauty product roughly every four weeks. If Alice hasn’t made a purchase in two weeks, this should certainly not be a cause for concern, given her buying cadence. On the other hand, consider Bob, who makes a purchase once a week. If it has been two weeks since we last heard from Bob, we might start getting worried, and consider Bob an “at-risk” customer, meaning he has a high likelihood of becoming inactive and being lured away to another brand. If we still haven’t heard from Bob after two months, it’s probably accurate to call him “lost”, i.e. it’s unlikely that he’ll purchase again.

In practice, the decision of whether a customer is active/at-risk/lost is not made via a static rule like the above, but using a sophisticated Bayesian model, which predicts the probability of a customer making a purchase in the future. If the probability is above a certain threshold, we label a customer as “active”:

One of the defining features of machine learning models is that they are never expected to be perfectly accurate, but only get things right in a statistical sense. For example, a model with 99% accuracy will still make the wrong prediction 1 out of 100 times. The non-deterministic nature of the model then leads us back to the same question we asked above: if we observe the number of active customers over time, how can can we know that the ups and downs we see in figure 1 are real, and not just due to the inherent randomness of machine learning models?

Following this argument, one may think that time series data for predictive scores is simply not useful at all. The good news is that as we will see, there is no reason to panic, and we will be able to save our model with some extra work; we will just have to do some math to understand the problem better.

Cost functions

Most of the machine learning models we use at Bluecore fall within the category of supervised learning algorithms. These types of algorithms take some input data on user behavior, and output a prediction for whatever outcome we are interested in (here: is a customer active/at-risk/lost?). In our example, the model takes some input data h describing a customer’s historical purchase behavior, and predicts the probability that they will make a purchase in the future. The prediction itself is made by applying a Bayesian model, which uses a small number of parameters (let’s call them a and b):

Because our goal is to make a good prediction, we need to train our algorithm first. “Training” more or less means what you think it means: we show the algorithm a lot of examples of correct predictions, so it can “learn” and improve its ability to make good predictions. For our model, this means finding the best values for a and b.

To find these best values, the model attempts to minimize a cost function F(a,b). This is a function of all parameters that quantifies how accurate our model can predict the training data. More loosely speaking, it tells us how “costly” it is to make a wrong prediction, so this function is certainly something we would like to keep small. An example is shown in figure 2: during the training stage, the algorithm finds the values for a and b that minimize the cost function, and therefore give us the best predictions. You can think of this process as literally rolling a ball down a curved surface, and waiting for it to settle at the lowest point:

The actual shape of the cost function depends on the training data we use, so when we minimize the cost function we are finding the model that performs best on our training data. In the case of the lifecycle stage model however, we would like to re-train the model on a regular basis, to take into account changes in the underlying data. This means that every day (or whenever we decide to re-train the model), the shape of the cost function is different, and therefore the choice of best parameters is different. In fact, if we are not careful, even a small change in the dataset can cause the parameters to fluctuate wildly:

figure 3: even though the cost function only changes slightly, parameters can jump dramatically.

Recall that the parameters will determine if we predict Alice/Bob to be active or not. Therefore, if a and b fluctuate wildly, our percentage of active customers in figure 1 may also have wild fluctuations, even if the data changes only a little! This problem is generally referred to as a “high variance problem” and is a common challenge in machine learning applications.

The Solution

To solve the high variance problem, we modified the cost function by adding a “memory term”:

The job of the memory term is to “remember” the values a’, b’ we got from our previous training run, and add an additional cost whenever the new values a and b are significantly different from those previous values. In this way, we are keeping the model from fluctuating too wildly from training to training.

NOTE: of course, the crucial part of making this approach work is to design the memory term well. It turns out that trying to find a good memory function actually led us to discover some interesting math along the way, so if you are interested in learning more, check out the section ‘Mathematical Background’ at the end.

Armed with this new modified cost function, let’s look again at the example of wildly fluctuating model parameters that gave us a headache. As we can see, now the effective cost function actually limits how much our parameters can change (or: how far the ball can roll):

figure 4: modified cost function (red). The parameters are kept closer to their previous value by the memory term.

With this modification, changes over time will be more gradual, and we avoid large jumps of our model predictions. We found that the new model achieves statistically equivalent performance, while making the outputs much smoother over time. Tying things back to our original problem of tracking the number of active customers over time, let’s see how this new model performs:

figure 5: percentage of active customers (comparison)

We see that the memory-term model gives us a much smoother graph, while still capturing the main trends. It turns out that our original suspicion was justified: although there are still some ups and downs, many of the more extreme jumps seem to have been just artifacts of re-training the model, and not real changes in customer behavior.

Of course, there is no free lunch, so technically the original model will be slightly more accurate, but in our case the difference in accuracy was not statistically significant. We have effectively traded a little bit of bias for a large reduction in variance! As a result, we are providing our partners with a clearer and more actionable view of their customers.

Conclusion

The memory-term method we developed makes our models more stable, and the output more interpretable. Armed with this knowledge, we now have even more confidence in the temporal stability of Bluecore’s predictive models, and can use them to monitor the health of our partners’ customers as they evolve over time.

As a side effect, this exploration also taught us a valuable general lesson: if someone shows you a graph or curve from which they derive some conclusions, always be skeptical and ask lots of questions! In our case, questioning the interpretability of a graph is what led us to a much deeper understanding of our Bayesian models, and ultimately gave us more confidence that we can trust the output of our models over time.

figure 6: The Persistence of Memory (Salvador Dali)

Mathematical Background: Temporal Regularization of Bayesian Models

This section contains some additional mathematical background, for the interested reader.

By adding a “memory term” to the cost function, we have effectively performed a type of regularization of our model. To adapt this technique to our use case of re-training models at different points in time, we had to borrow and combine a few concepts from the mathematical literature to achieve the temporal regularization needed here.

Typically, regularization techniques reduce the variance of a model by keeping the parameters themselves small. One common way to achieve this effect is to add a “penalty” term to the cost function, for example a quadratic term:

Because it now costs extra to make the parameters large, they tend to be confined in a small interval, preventing them from jumping too much. While this technique works well in some standard classification problems, in our case it is not quite what we want: in general, there is nothing wrong with simply having large parameters, so we shouldn’t punish the model for choosing large a and b. What we really want to do is tell the model to choose values a, b that are not very far from the values a’, b’ that we obtained in the previous training. To accomplish this, we should really add a penalty term that will penalize (a, b) being different from (a’, b’):

But how exactly should we design this term? In our specific application we are working with a Bayesian model, so the standard choices of L2 or L1 regularization do not make as much sense in our case. This is because in Bayesian models, the parameters themselves can be thought of as defining a prior distribution for some observable random variable, for example the probability to make a purchase.

To understand how priors are used in Bayesian models, let’s consider an illustrative example: assume we have a coin with bias 0<p<1. Given a number of observations, e.g. 100 coin flips and their outcome, we would like to predict the probability of getting “tails” in the future. A simple Bayesian model might be using a beta distribution as a prior for the bias p:

In this case the parameters a and b can be thought of as setting the shape of the distribution. Now, imagine we train our model using maximum likelihood estimation (or similar), and initially find that the best parameter choices are a’=b’=2. If we re-train the model and those parameters change to a=4, b=2, how “different” is the resulting new distribution, and how much should we penalize this difference in our cost function via the memory term? Let’s say we fix the penalty to be the quadratic difference

With this choice, the regularization penalty would be 4.0. Now, consider the case where a=b=4 instead. According to our formula above, the penalty should now be 8.0, since both parameters increased by 2. A quadratic memory term would therefore deem a=4, b=2 to be closer to the original distribution than a=b=4. However, if we look at the shape of the distributions in figure 6, we see that the choice a=b=4 only narrows the distribution, while choosing a=4, b=2 leads to both narrowing and an additional shift. Therefore, one could argue that the red curve is in a sense qualitatively “more similar” to the blue curve than the green curve is, because it still has the same mean. So maybe we should actually penalize this choice of parameters less, or at least not quite twice as much?

Clearly these arguments are all very subjective, so we need a more rigorous definition of what it means for our prior to “change significantly”, since this is what we want to punish with our memory term. In particular, we would like to quantify how different one statistical distribution is from another. Thankfully, there exist a number of measures that do exactly this. One of the simplest among them is the Bhattacharyya distance between two distributions g and h:

We can gain some intuition about this formula by looking at extreme cases: If g=h, the integral is equal to 1, so BC = 0. In the other extreme case, if g and h don’t have any overlap (i.e. the intersection of their support is empty), the integral will be zero, so BC= ∞. For any case in between those extremes, we can convince ourselves that the BC distance will be smaller the more overlap g and h have (i.e. their difference is small in any interval of size epsilon).

Equipped with this new measure of similarity for probability densities, we can now solve the problem at hand. To force our prior distribution to be “not too different” from the previous distribution, we simply choose the memory term to be proportional to the BC distance between today’s prior distribution and the previous distribution:

The extra term introduces the desired memory effect to our model, and forces (a,b) to be close to the previous values (a’,b’).

We should note that there exist many alternative ways to measure similarity between statistical distributions. In general, any measure that can capture the difference between distributions is a good candidate for achieving the desired regularizing effect. For example, in our exploration we also considered the Kullback-Leibler divergence, which is an asymmetric measure of divergence. While we can easily symmetrize the measure to obtain a proper metric, a bigger problem was the fact that it was not possible to obtain a closed form solution for our model, which complicates the computation. We therefore decided to discard this approach in favor of the BC distance which has a closed form solution for our model, is simpler, and easily interpretable. This is in line with one of Bluecore’s core values of finding solutions that are as simple as possible, and as powerful as necessary.