AI Applied: Inner Workings of Audience Expansion — Part I
Audience Expansion is a great example of AI/ML being applied in the industry to create value. In this blogpost series, we want to explore how Audience Expansion came to be, and in a second instalment, discuss the technical algorithm details.
Our customers like to be able to define segments with fine-grained selection criteria, but they also need to meet their reach requirements. In many scenarios, our predictive capabilities allow them to trade off precision and reach, for example by adjusting the confidence level of the chosen socio-demographic filters. In other cases this process is, unfortunately, inherently difficult. Imagine an audience based on the interaction with web-pages containing specific keywords, i.e. what we call a keyword audience. How could you extend the reach without sacrificing the accuracy and representativeness of the chosen set of keywords?
If you are familiar with Data Management Platforms (DMPs) and digital advertising, then probably a lot of analogous use cases and corresponding challenges are coming to your mind as you read this. It is exactly to overcome such difficulties and to enable such use-cases that we augmented our audience building process with the ability to automatically expand an audience to the desired reach by including similar users. The key to doing this is to define a generic measure of similarity between users of different types and consumptions patterns. Our representation learning scheme does exactly this in an attribute-agnostic manner.
Lights, camera, action!
Imagine you are about to run a campaign about winter sports equipment. To this aim, you define a topic that includes keywords such as ski, snowboard, ice skating, and then create an audience made up of those users who recently interacted with web-pages containing these keywords, as shown in the above screenshot*. We call this the “seed” audience. Your audience includes 244K users, but you feel it would be better to have a slightly larger reach, for example 300K, so you move your mouse further down and type 56K in the Additional profiles input box. As a result, the corresponding slider moves, the counter at the top of the page shows 300K, and a coffee cup appears; it indicates that your audience is now in the process of being computed. What happens behind the scenes is very powerful: our algorithms will now seek additional profiles, based on how similar they are to the “seed” audience.
A little while later, you check the status and find that the coffee cup disappeared: the audience is ready! Behind the scenes, quite some computation has been carried out; in particular, a model encoding the particular traits of your “seed” audience has been learned and used to rank all your users based on similarity. Then, the top 36K ones were selected and added to the audience.
At this point, you may be wondering whether 36K was a good choice and whether there is room for further expansion without significantly sacrificing quality. To help you in this decision, we added a metric that can be seen on the left of the expansion slider; it tells you that the current uplift over random is 692%, i.e. that the model is approximately 7.9 times better than a random model. But better at what? Better at retrieving a hold-out set of users from the base audience…
The uplift over random metrics
The concept is a bit difficult to grasp, isn’t it? Let’s be more concrete and dive a little into the math behind it. Let us sample a small number of users, say 500, from the “seed” audience and set them aside. Then, let’s pick 56K users from the universe (excluding the “seed” audience) in two different modes. The first group of 56k users will be selected at random, while the second group is picked based on the scores assigned by the model we learned (the model that ranked the users’ similarity)**. Now, let’s count how many of the initial 500 users are actually in either of these two groups. In the random-based set, we find 30 of them; in the model-based set, 237. Thus, our model is 7.9 (30 x 7.9 = 237) times better than random***.
An uplift of 692% sounds promising and therefore you feel it makes sense to further expand the audience. You check the metrics at an expansion size of 100K; still good. You feel adventurous and check at 500K; too much, the uplift is only 38%, it is almost a random expansion. You opt for the 100K expansion.
Finally, you open the Audience Explorer to check demographics and interest of your expanded audience. Nice!
In the next instalment, we’re going to dive into the mathematical background of the algorithm. How did it do this? How can we theoretically approach the ‘expansion’ problem, and how does the algorithm’s approach compare? Stay tuned for part II!
* Screenshots, and numbers therein, are based on a demo account.
**Excluding those 500 users from the training and validation sets.
***We would like to mention that the metrics definition / computation also involves some assumptions and limitations that we could not describe in the context of this post. We encourage the interested reader to contact us for further information.