An identity graph, similar to those created by OpenJaw’s data science team for identity resolution.

Identity resolution for travel retailing (part 2)

Published in

The OpenJaw Data Science Blog

7 min readNov 3, 2019

In this article, we share how we create a single-customer-view at OpenJaw using a new identity resolution algorithm we have developed which is optimised for the travel domain. This is the second part of a three-part article on this topic.

By Auren Ferguson, John Carney, Vignesh Mohan, Gavin Kiernan, Yuxiao Wang and Beibei Flynn.

Introduction

In part 1 of this article, we described the key challenges of identity resolution in the travel domain. In this part, we will describe some of the details of the algorithm we have developed at OpenJaw to solve this problem.

Data cleansing

It’s commonly mentioned that about 80% of a data scientist’s time is tied up with data cleansing. Identity resolution is no different in this regard. Cleansing and standardising the data is a fundamental and, unfortunately, very time consuming part of the process.

The first step is to get the data from all sources into a common schema. At OpenJaw, we have devised a flexible template schema that standardises field names and data types, for example:

given_name: string
surname: string
date_of_birth: integer
booking_reference: string

We then combine the data from all sources into a single dataset and proceed to cleanse it. The cleansing process can vary massively from dataset to dataset but generally would involve steps such as:

Convert all string fields to lower case
Remove accents from names
Remove/add phone number country codes
Remove non-numerics from phone numbers
Extract information such as gender and date of birth from identification documents

While this stage is often tedious and labour intensive, the goal is to get the data in good shape, ready to pass through the rest of the process seamlessly.

Reducing computational overhead with ‘blocking’

As mentioned in part 1 of this article, the computational expense of identity resolution is extremely high. The next steps are to reduce the number of potential matches by data reduction/de-duplication and a method called ‘blocking’.

Data from three sources are standardised, combined into one format and cleansed.

For data reduction, we assign each vector of personal identifier information a unique reference number. Depending on the degree of duplication, this can greatly cut down the number of records that need to be compared. For example, if a frequent customer (> 100 bookings) books using the same details each time, we can reduce that number down to 1 record, or as we call it; a persona version. A person/entity could have 10’s or 100’s of these personas and the job of identity resolution is to link them together.

Blocking is a method used to reduce the overall number of comparisons between records during the identity resolution process. It restricts the number of comparisons by creating subsets of the data by a certain criteria and only performs full matching between records within the subset. Blocking methods have been studied extensively in academia (see e.g. (1)), with methods ranging from very simple blocks, such as postcode to more complicated methods like phonetics of surname (similarly sounding surnames). Which blocking techniques to use will depend on the given data and there is no ‘correct’ block that will capture all matches in the same group.

To frame it from a machine learning perspective, choosing a block is a precision-recall problem. We want to balance getting as many potential matches into the same block but also dramatically reduce the overall number of comparisons, reducing computational load. A simple block we have used before is the first letter of the first name and the first two letters of surname. The rational being that when filling out online forms, people are far less likely to make mistakes/typos at the beginning of their names.

Blocking will always exclude matches, using the previous example, what if someone gets married and chooses to change their surname? The name block mentioned won’t work. As a result, it’s common to use multiple blocks to reduce this error, e.g. use name block and the year from date of birth, as shown in the figure below. There are two entities in this example, William (Will/William) and Mary. Using the name block, we capture all of William’s records (block ‘ws’) but not Mary’s (block ‘mt’ and ‘mp’), since she recently got married and changed her name. The second block captures her records since the year of birth is the same. Since row 3 has no date of birth the relationship between it and row 1 and 2 are missed. The different candidate matches from different blocks are combined and deduplicated to create the final candidate matches. These are the records that will be ‘fuzzy matched’ later. The number of comparisons has reduced from 10, in the case of no blocking, to 4 comparisons using blocking, yet captures all the information required to resolve identities.

*Blocking process using two methods, the first method uses the first letter of given name and surname and the second block is the year of birth.*

The next stage of the process is the most technically and computationally challenging part as it involves the creation of the candidate matches on which fuzzy matching will be done. This involves a very large self-join on the blocking field(s) of the input data that balloons the data volume dramatically. In a typical implementation at OpenJaw, we may have as many as 100M records to resolve, which corresponds to approx 250 billion candidate matches. This is compounded by the fact that many of the datasets we work with at OpenJaw are highly skewed, with many blocks having over 100,000 records, but most having only 10–100’s of records.

To handle this complexity and make this a tractable problem computationally, we developed the following method:

1) We ‘salt’ the left version of the dataset so it’s evenly distributed between nodes/executors (2). This involves adding a degree of randomness to the join key. In this case a number between 1 and 1000 is appended the blocking field e.g the block is birth year (1980), salted key becomes 1980123. We can then repartition the dataset on this key to make it evenly distributed.

‘Salting’ identity data is an important part of identity resolution at OpenJaw.

2) We break the right dataset (i.e. a non-salted copy of the left dataset) into 2 parts, one with skewed blocks (this is an arbitrarily large minimum block size depending on data volume) and one without skewed blocks. The skewed dataset is broken up into a number of chunks, with each chunk having all records of a block and approximately have the same number of comparisons to calculate.

3) We perform iterative broadcast joins of each chunk to the left dataset. This removes the skew and each executor takes around the same time to complete.

4) We perform a regular sort merge join on the un-skewed data and combine them back into one dataset that is ready for fuzzy matching.

Using this method, as well as creating persona profiles as mentioned above, reduced the run time from over a weekend to just under an hour!

Measuring string similarities with fuzzy matching

At this stage, the data is ready for fuzzy matching algorithms to measure string similarities between fields of candidate matches. There is a whole host of string similarity metrics. For identity resolution we are primarily concerned with edit distance metrics, that is to say, metrics that compute the number of operations needed to transform one string to another.

In our solution, we use a combination of exact matching for fields such as passport and national ID numbers, ‘Jaro-Winkler’ (3) distance for names (since it places more importance to differences at the beginning of strings where differences are more likely to be real rather than typos, e.g. the assumption that name typos will be towards the end of a name) and ‘Levenshtein’ (4) distance for phone numbers, emails, loyalty numbers, etc.

Comparison of a name and phone number using different similarity metrics.

The aim of these algorithms is to quantify how close strings are to each other; if the similarity score is 0, they are considered completely different, whereas a score of 1 is equivalent to an exact match. The table below provides some examples of typical outputs from these matching algorithms. As is evident in the results, Jaro-Winkler is generally preferred with names since it weights the beginnings of strings more, whereas Levenshtein distance shows good performance on strings of numbers.

To recap, we have measured the similarities between fields for all blocks in the data, where its typical to fuzzy match between 5–10 fields for each candidate match. The next step is to decide if each candidate match does or doesn’t belong to the same entity.

In part 3 of the article…

In the next part of this article, we will describe how we use a combination of empirical matching rules, machine learning and graph theory to complete the identity resolution process and create a single-customer-view.

References

(1) Steven Euijong Whang, David Menestrina, Georgia Koutrika, Martin Theobald, and Hector Garcia-Molina. 2009. Entity resolution with iterative blocking. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data (SIGMOD ‘09), Carsten Binnig and Benoit Dageville (Eds.). ACM, New York, NY, USA, 219–232. DOI: https://doi.org/10.1145/1559845.1559870

(2) (a) https://bigdatacraziness.wordpress.com/2018/01/05/oh-my-god-is-my-data-skewed/. (b) https://dzone.com/articles/why-your-spark-apps-are-slow-or-failing-part-ii-da. (c) https://medium.com/appsflyer/salting-your-spark-to-scale-e6f1c87dd18

(3) https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance

(4) https://en.wikipedia.org/wiki/Levenshtein_distance