Location Data and Lookalike Modelling

3 min readFeb 4, 2019

In advertising, lookalike modelling is used to increase your audience size.

Let’s suppose you have a set of users that make up your audience, but you know it is not enough to deliver the campaign your client is so desperately asking for. What do you do? You take your audience and you “extend” it with “similar” users: the positive impact on delivery is obvious, but there is also the additional advantage to reach users that might not know about your client yet (user acquisition).

While lookalike is mentioned in hundreds of places ([1], [2] for instance), very rarely is explain from a technical point of view, as there are several different ways to solve this problem.

Location data

Data I am looking for this PoC should contain an identifier alongside time and GPS location. After a short search on Google, I stumble on the Microsoft Research T-Drive Trajectory Data Sample, a dataset containing taxi locations, which is perfect for this project (I suggest you looking at the associated publications too, as it is really interesting stuff).

Location data is a weird beast: latitude and longitude are continuous, but their relationship is not linear, which makes them very hard to model. For this reason, I need to convert them into a discrete space (i.e. make it categorical), and the obvious (albeit not the best) solution is to use geohash:

After processing all the data (read + concatenate), we transform the data into a taxi/geohash matrix:

Modelling

Now that we have a taxi/geohash matrix, we can model the relationship between taxis in many different ways: for this article, I’ll choose the simplest by far, NearestNeighbors from scikit-learn.

Testing

Here the fun part: how do we test it? As I want to keep this article nice and short, I will just plot a few charts and prove to you it works. :-)

The code below does the follow:

select a random taxi
find the 2 most similar taxis
plot them on a map

This is what you get:

How far are we from a production-grade system?

In short: very, very far!

But that’s not the aim of this article. Production-grade systems takes time to develop, extensive testing and a constant monitoring over time, which is much more than I wanted to talk about.

References and resources

https://medium.com/radon-dev/als-implicit-collaborative-filtering-5ed653ba39fe
https://github.com/nikhitmago/lookalike-modelling
https://www.microsoft.com/en-us/research/publication/t-drive-trajectory-data-sample/
https://towardsdatascience.com/how-did-we-build-book-recommender-systems-in-an-hour-part-2-k-nearest-neighbors-and-matrix-c04b3c2ef55c (actually, most of the code for this article is taken from here… must read!)