Rethinking retargeting with machine learning: a technical quest

Published in

incentro

10 min readSep 20, 2019

Now, who wouldn’t want to be retargeted for a vacation like this? Photo by Fezbot2000 on Unsplash.

Retargeting. The practice of persuading potential customers to come back to a website after they have failed to convert on your website. Everyone who shops online has been subject to retargeting at one point or another; you just go about your business, and suddenly you see banners of a product you just looked at everywhere. So far, retargeting works. The problem is that often you did not intend to buy that product, or that you have, in the meantime, already purchased a similar product. This does not only make retargeting annoying for you, but it also means the company you’re targeted by wasted marketing spend on you.

There has to be a smarter way to tackle this problem, we thought at Incentro. And there is. In this blog, I will take you on our quest to build a custom machine learning model to help a Dutch start-up to retarget customers smarter. If you’re not technical, don’t be scared! I’ll keep you interested, but it does help if you have at least some basic machine learning knowledge.

tl;dr: machine learning-powered retargeting caused conversions to rise by 100%!

first things first: the case.

This was a real problem for a real customer of Incentro: vakanties.nl. Vakanties.nl is a Dutch start-up that sells dynamically composed package vacations tailored to a customer’s needs. They were struggling with how to spend their marketing budget exactly; they wanted to make sure to retarget customers with the highest likelihood of converting. Their strategy up until that moment was to retarget customers that had clicked on certain buttons on the website. Even though human intuition says that clicking on a button probably means you are more interested in the product, this is not a given. There might be other things that are way more indicative of a person’s willingness to buy, for example, the time someone has spent on a page or the time of day. The problem is that these patterns are hard for humans to find because humans are not good at analyzing millions of website log lines to find causal relationships. Enter the wonderful world of machine learning 🌞

It was our task to, for every person that visited the website and did not end up buying a vacation, estimate the probability that the customer would buy a vacation if (s)he would come back to the website. This is not a trivial task (or I would not be writing a blog about this right now). How did we do this? It all boils down to taking all the steps a website visitor has taken on the vakanties.nl website into account. I will explain to you precisely what we did, but first a little intermezzo about recurrent neural networks.

intermezzo: recurrent neural nets

If you’re not a machine learning engineer, chances are you have never heard of recurrent neural networks (RNNs). Recurrent neural networks are not your typical vanilla, plain neural network. To illustrate this, take the following picture depicting a ‘regular’ deep neural network. If you have any experience with machine learning, this image will probably look familiar to you.

What the image depicts is a neural network with three input nodes, one hidden layer with four nodes, and one output layer with two nodes. This means that we have some input, and input goes to hidden nodes which will apply some weights to the input, which will, in turn, go on to the output nodes. Information that you feed into the model will never touch the same node twice. If we try to solve the vakanties.nl case with a neural network like this, we run into a problem.

Let’s start with the output layer. There are two nodes, representing the probability that a website visitor does not convert and the probability the website visitor does convert. These probabilities, of course, add up to 1 as there is no alternative option in this world of website conversions. So far, so good. However, what is our input? The number of clicks? The total amount of time spent on the website? All valid features of a machine learning model, but one thing a regular neural network can’t do is capture a notion of time. For example, there is no straightforward way to feed into this example neural network that a website visitor has first filtered the search results on vakanties.nl on dates, then on country, and then filtered on specific hotels. But this might be valuable information that you do want the neural network to know about. One other way we could try is feeding each step of a website visitors journey into the model, but a regular neural network has no way to represent these separate pieces of information as one customer journey.

Luckily for us, in the ’80s, some brilliant people figured out a solution to this problem in the form of recurrent neural networks.

What makes recurrent neural networks so unique? Like I explained to you above, regular neural nets have no notion of time and cannot represent data that involves sequences. Information goes through the model without touching a node twice.

With recurrent neural networks, the nodes have regular inputs just like the neural network depicted above, but the nodes have a second input, which is a representation of everything they have seen before in the same sequence. Let’s illustrate this by another image.

Visualization of a recurrent neural network. Image source

Here, x represents an input node, the blue block denoted by h is a hidden node, and the red circle denoted by o is an output node. As you can see, instead of the information passing straight from input-hidden nodes-output, the hidden layer feeds information of everything it has seen so far back to itself. This means the RNN can take into account the entire history of the customer journey when making a prediction!👏👏👏👏👏

(note: there are many different ‘flavors’ of RNNs. As much as I would love to tell you all about them, this is probably not the time and place 😢. If you would like to learn more, this is an excellent place to start!)

let the programming begin.

So, now that we had figured out that recurrent neural networks were probably the way to go, the easiest part of our job was done (deciding what we had to do). Now, we actually had to do it.

This is a strange case in the sense that we really didn’t have any data on the type of customers that should be retargeted: customers that were on the website and left, but then came back and converted. So, we decided to go about this a little differently. In the end, the last step of a customer (let’s call this customer A) that we might want to retarget is always leaving the website before converting (because if they had converted, we wouldn’t have to retarget them). The last step of a visitor that has converted (let’s call this customer B) is always the conversion. We don’t include this final step of customer B in the training of the model, because if we want to predict the probability of conversion of customer A, the journey of customer A never includes the last step of customer B (the conversion step).

The underlying assumption we do is as follows: if customer X and Y leave the website, and customer X had a higher probability of converting in the moment right before leaving the website than customer Y then customer X still has a higher probability of converting after leaving the website.

input features and model architecture.

It seems logical that there is probably a positive correlation between the number of clicks a customer has made on the website and whether a customer converts or not. To make sure the model does not blindly predict customers with longer sessions to convert based only on the journey length, we reproduce the journeys of customers that have converted four times. So, if a customer that converts has made forty clicks, the model might see the same journey during training four times: one time with only ten click, one time with the first twenty clicks, one time with the first thirty clicks and one time with all the clicks (all with the label ‘converted’ attached to it). This also came in handy because we had many more customer journeys of customers that had not converted than had converted. Using this technique, we solved some of the imbalance between the occurrence of labels converted/does not convert.

So, what we did: feeding lots of customer journeys of customers that did convert and did not convert to the model, and try to let the model predict whether or not whether a given customer will convert at any given point in the customer journey.

We had two types of features; features that were specific to one step in the customer journey (such as the button the customer clicked on, the time of the click) and context features that were applicable to the entire customer journey (for example device type the customer uses and through which channel the customer entered website). The problem with RNNs is that they are less proficient in representing those context features, so we decided on going with the best of both worlds: a hybrid RNN and DNN model 😎 See our model architecture below:

building the model.

When we were developing the model (early 2019), there were no out-of-the-box models that were suitable for this use case, so we decided upon building a custom model in TensorFlow 1.14 (in Python), using the tf.estimator.estimator class. This was easier said than done, and we have spent many, many days staring at our computer screens in utter desperation. Lesson learned: it will save you a lot of time and frustration if you do find an out-of-the-box machine learning model that is suitable for your use case.

I’m going to show you a little bit of code now (DON’T BE SCARED) to show you how (relatively) simple the basic building blocks of the model are. The following code snippet denotes our model, consisting out of a single GRU layer (a particular type of RNN cell) with a DNN layer:

def rnn1_model(features, mode):
    cell = tf.nn.rnn_cell.GRUCell(RNN1_SIZE)
    outputs, state = tf.nn.dynamic_rnn(cell, tf.stack(seq_features, axis=2), dtype=tf.float32)if len(ctx_features) == 0:
        x_dnn = state
    else:
        listed = [state, tf.stack(ctx_features, axis=1)]
        x_dnn = tf.concat(values=listed, axis=1)
    probabilities = tf.layers.dense(x_dnn, 1, activation=tf.sigmoid)return probabilities

Looks simple enough, right? Well, it is, but constructing the model architecture is just a small part of making a machine learning model successful.

the model learns!!! or does it?

After all those days in desperation about why our model would not work, imagine our joy when the model finally started training. However, just the fact that a model trains does not mean it also learns. See below a graph of the loss of the model during a training round:

The loss during the first training rounds of our model.

This is typical of a model that cannot seem to find a connection between the input and the output: there are lots of spikes, and there does not seem to be any consistent downward trend in the loss. In other words: the model did not converge. We did solve this problem (🎉), but it took multiple changes to the model:

we did not have the right set of hyperparameters for the RNN. We discarded the DNN temporarily and first did hyperparameter tuning for the RNN. After that, we added the DNN back to the model
we increased the batch size: we did this because we assumed that the gradient of the loss function was so spiky due to the relatively low amount of ‘converged’ labels. Boosting the batch size to a value of 2048 or 4096 started the converging behavior.
we had to be patient. One single epoch was not enough to escape the vanishing gradients. Instead, we had to run training for multiple epochs.

This all caused the model to converge, and we FINALLY had a model that was ready to work!

total architecture

As said before, the actual model is just a part of the whole application. I know I skipped over a lot of the details, but I still wanted to give you a high-level overview of the application. Below is a diagram of the entire architecture:

Our application architecture. Note: Cloud ML Engine has now been renamed to AI Platform.

So, the entire architecture is as follows: when a customer visits the website of vakanties.nl, the events are, with the help of cloud pub/sub and dataflow, sent to BigQuery. Then, again through Dataflow, we took the data out of BigQuery and converted the data into a format that was used as input for the machine learning model (TF-records), and it was stored in Cloud Storage. The model then reads the data from Cloud Storage to train, and the trained model is exported back to Cloud Storage from where the model is exported to Cloud ML Engine (now renamed to AI Platform). Every time a customer leaves the website is then sent to the deployed model, and the model outputs the likelihood that the customer will convert!

results

While I could write down the theoretical results here, we also have real-world results since the model has been live for a while now. Some facts and figures:

the number of conversions rose by 100% by this machine learning-powered retargeting 😎
costs per transaction dropped with 50–100% compared to a situation without the machine learning model and using smart remarketing lists
acquisition costs decreased by 50%

Pretty cool right? Many machine learning blogs out there concern toy examples, this is a relatively rare example of a custom machine learning model that’s not only used in production but has had a significant impact on the day-to-day operations. Keep in mind, this time it was for a website that sells vacations, but the same algorithm can be applied to any website that sells anything. Shoes, skateboards, books: you name it. The model just needs to be retrained.

Well, that’s all for today! I hope you found this interesting and if you have any questions, you know where to find me. 🙋