Identity resolution for travel retailing (part 1)

--

In this article, we share how we create a single-customer-view at OpenJaw using a new identity resolution algorithm we have developed which is optimised for the travel domain. This is the first part of a three-part article on this topic.

By Auren Ferguson, John Carney, Vignesh Mohan, Gavin Kiernan, Yuxiao Wang and Beibei Flynn.

Introduction

In the travel sector, customer data is stored and managed across a multitude of different systems, spanning Passenger Servicing Systems (PSS), e-commerce systems, Customer Relationship Management (CRM) systems, Loyalty systems, digital marketing systems and, indirectly, via external systems such as social media platforms. At OpenJaw we can consume data from all of these sources, integrate it and use machine learning to generate insights on customer behaviour that can be used for personalisation.

The first step in this journey is to create a single-customer-view. The problem with combining customer data from a disparate set of data sources is that rarely is there a common, unique key that links all records that belongs to an entity. This is why identity resolution is required, as it provides a probabilistic / machine learning approach to linking entities from disparate data sources together. It is also common for identity resolution to be performed within one data source to link transactions within one system together.

A common example of where identity resolution is required at OpenJaw is that when customers book flights, hotels or vehicles on our e-commerce platform, they can use different combinations of personal information. For example, a person may use a passport as an ID when flying internationally but uses a driver’s licence or national ID when flying domestically. Usually, a passport or national ID is as close to a unique identifier a person can have, but in this instance, using conventional strict matching, these two flights will not be considered as being purchased by the same person. However, there could be similar information such as name, email and phone number(s) that could be used to match them with a high degree of probability using fuzzy matching, machine learning and graph algorithms.

Since our end goal is to deliver highly personalised customer experiences, it is imperative to be able to link customer records together in a reliable way. An analogy we like to use is that implementing identity resolution in personalisation is like building the foundations of a house; it is necessary to ensure that everything that is built after it performs as expected (e.g. predictive models of customer behaviour) and it is necessary to ensure customers are receiving relevant personalised offers that are aligned with marketing consents they have provided under data protection laws such as GDPR.

Key challenges in implementing identity resolution in travel

Conceptually, identity resolution is very simple to understand. But it is notoriously difficult to implement, especially in travel. This is due to a myriad of factors. For example, many of the airlines we work with at OpenJaw transport over 50 million passengers a year. When you collect this data over several years, it may resolve to over 100 million unique customer profiles. When you also consider the multitude of source systems that this data can originate from, beyond core systems like the PSS systems, volumes are very large. The velocity at which this data is generated is also a challenge, as well as data quality issues (missing values, inconsistent data input) and schema variations across systems in the travel ecosystem.

Of course, this challenge of volume, velocity and variability in the data leads to a very heavy computational load. It scales in a non-linear fashion (a square relationship with data volume), since each record needs to be compared to all other records to check if they could be a match. For instance, if an input dataset has 1000 records, it would take 500,000 comparisons to check for all possible combinations of matches. This doesn’t sound too bad, but if the input data size is 100 million records, this corresponds to 5x10¹⁶ comparisons! To compound this issue, many fields (first name, surname, date of birth, phone, email, etc) for each potential match need to be analysed using fuzzy matching algorithms, which by themselves can be expensive computational operations.

As an example, the OpenJaw data science team recently implemented our identity resolution algorithm for a large international airline. The ‘cleansed’ input data had approximately 100 million records and we were searching for matches across 7 fields / identifiers. If we were to compare all records for potential matches it would have required 3.5x10¹⁷ comparisons. Obviously, it is not reasonable to compute this, as 10¹⁷ is an unimaginably large number. For comparison, it’s estimated that there are 250 billion (2.5x10¹¹) stars in the Milky Way, this number is 1 million times larger. To help alleviate the data volume issues, we used techniques to reduce the number of potential matches and use distributed computing frameworks.

Technology used at OpenJaw for identity resolution

Given the Big Data workloads that real-world identity resolution in travel involves — spanning volume, velocity and variety — we built our solution at OpenJaw using Apache Spark.

This platform is particularly well suited to identity resolution because it has capabilities that span ETL (extract, transform, load), machine learning and graph algorithms all in a highly scalable, parallel processing framework. It also allows us to easily deploy on both Cloud and on-premise infrastructure.

In our Cloud implementation we use Amazon Web Services (AWS). This spins up an Elastic Map Reduce (EMR) cluster that consumes input datasets stored in AWS S3.

For on-premise solutions, we use a Cloudera cluster running Apache Hadoop YARN as the application master and Hadoop Distributed File System (HDFS) for storage.

In part 2 of this article…

In part 2 we will describe in detail the identity resolution algorithm we have developed at OpenJaw that runs at scale on the infrastructure described above. This is a sophisticated algorithm that uses a combination of probabilistic (‘fuzzy’) matching algorithms, graph theory and machine learning to resolve identities across very large (100M+ records) customer data sets.

--

--

The OpenJaw Data Science Team
The OpenJaw Data Science Blog

The data science team at OpenJaw share their approach, opinions and methodologies for data science in travel.