An Approach to Device Cross-Linking

Data Reply
DataReply
Published in
3 min readMay 1, 2018

Cross-device tracking or Record linkage (RL) is one of the biggest challenges for digital marketers today. Being able to correctly identify the same users behind millions of distinct devices is not only a problem of identifying and applying the correct Machine Learning (ML) technique but also doing it in a feasible way, since the vast amount of data associated with problems like this one can even challenge the boundaries of technologies like Apache Spark.

SO WHAT IS IT REALLY ABOUT?

In the context I want to write about today, record linkage is the task of finding records in a data set that refer to the same user across different data sources (e.g., data files, books, websites, databases). It is a vital task when joining data sets based on users that may or may not share common identifiers (e.g., database key, URI, National identification number).

A data set that has undergone RL-oriented reconciliation may be referred to as being cross-linked. Record Linkage is called Data Linkage in many jurisdictions, but is the same process.

GETTING PRACTICAL

Suppose you are given a dataset characterised by the following specification:

  1. It is composed of the following features (columns):
  2. Unique Device Id: e.g. MAC addresses.
  3. Type of Device: e.g. mobile, tablet, PC, laptop.
  4. IP addresses.
  5. A timestamp associated with some activity, e.g. accessing a sports App on a mobile device or a tablet, or a certain website (url) through another device.
  6. The make of the device, e.g. HTC, Samsung, Apple.
  7. The model of the device, e.g. iPhone 6S.
  8. The name of the App or of the url accessed.
  9. Assume millions of records.
  10. Assume an average of 100 records per device.
  11. You have no cookie data on the devices or additional third-party (Google, Facebook) identification information that would make your task less complex.

The question is: what kind of techniques would you use in order to identify the same user behind different devices, i.e. to correctly group devices that belong to the same user?

This is a multifaceted problem with no one solution. Actually, there is no optimal solution to this problem and this is because of the complexity of the available information and of the possible ways one can use that data in order to identify the same users behind distinct devices. Arguably, there is enough information within the above dataset in order to achieve a highly accurate solution by simply considering a brute force approach through comparing and analysing all pair-combinations of all records (cartesian product). However, if we are talking about processing millions of records with numerous features, and doing so on a daily basis through a batch process, such a solution seems infeasible. And in reality, as we will cover later in this article, we would typically wish to enrich this data set with additional data as well as a number of feature engineering techniques — compounding the computational challenge.

The real challenge here is finding a shortcut: how to avoid unnecessarily comparing records that are very unlikely to be linked in any way.

Generally, one needs to come up with a tractable solution that would optimally employ available resources and technologies to tackle the problem, and this is what this article is about.

Originally published at www.datareply.co.uk.

--

--