Identity resolution for travel retailing (part 3)

Published in

The OpenJaw Data Science Blog

14 min readDec 13, 2019

In this article, we share how we create a single-customer-view at OpenJaw using a new identity resolution algorithm we have developed which is optimised for the travel domain. This is the third and final part of a three-part article on this topic.

By Auren Ferguson, John Carney, Vignesh Mohan, Gavin Kiernan, Yuxiao Wang and Beibei Flynn.

Introduction

In part 2 of this article described some of the details of the algorithm we have developed at OpenJaw to solve this problem, spanning data cleansing, reducing computational overhead with ‘blocking’ and measuring string similarities with ‘fuzzy matching’. In this, the final part of the article, we will describe how we use a combination of empirical matching rules, machine learning and graph theory to complete the identity resolution process and create a single-customer-view. We also describe the results of experiments that compare the performance of the algorithm to the traditional ‘exact matching’ method.

Matching entities

Once we have measured the similarities between fields for all blocks in the data, the next step is to decide if each candidate match does or doesn’t belong to the same entity. Note that it is typical to fuzzy match 5–10 fields for each candidate match. There are 2 commonly used methods to achieve this:

Create a set of empirical matching rules, or;
Use machine learning to automatically decide if records match.

Empirical matching rules

The rules based method involves making a series of decisions about the data. Firstly, for each fuzzy matched field, a threshold for a positive match must be picked, e.g. for surname, if fuzzy match score is greater than 0.85 then they are considered the same surname. The threshold values to pick depends on the data sources and regions from where it’s from. Let’s say the data is from a country where generally surnames are very long, a matching threshold of 0.8 may suffice, but in a country like China, where names are typically 2 or 3 characters, an exact match (score of 1) may be required.

From here, a decision has to be made to decide if candidate matches belong to the same entity or not. To do this we use a hierarchy of rules, starting off with the strongest rule (the one with the least likelihood of being incorrect) and getting less strict until a minimum acceptable criteria is reached. While this sounds somewhat complicated, the pseudocode below gives an example of rules that could be implemented.

In the above code, the rules commence with the strongest rule. The motivation for it being that for a candidate match, if either passport or id number are both present and if they match, we would have strong confidence that they belong to the same entity.

The next level of the hierarchy is if given name, surname, date of birth, phone and email (fuzzy) match, they are considered the same entity. The logic for this being if these records have similar values for all these fields, then they are more likely a match than not, on the balance of probability.

The next rule is similar but only requires one of phone or email to match. While it’s less strict, it provides more flexibility and allows for variability for when data is collected (sometimes not all information will be collected). The minimum criteria for a positive match is if name, surname and date of birth match. It is up to the user to where to draw the line for which rules are acceptable because if they are too relaxed, the system will start incorrectly matching different peoples records together and incorrectly merge different people into one entity.

Machine learning

The other method of deciding if candidate matches belong to the same entity is to use machine learning. This has some major advantages over the empirical rules but also some disadvantages.

To use machine learning, a labeled dataset, that is a sample of the full ‘candidate matches’ dataset must be created that classifies a list of candidate matches to be positive or negative. This can be very labour intensive as it requires lots of manual (human) work to making these decisions. However, there are methods such as active learning that can reduce this. Also, the modelling of outcome can be difficult since the majority of the candidate matches in a block won’t match, making this a class imbalance problem.

The major advantage of using machine learning is that it lets the data and model make the decisions, i.e. there is no need to make decisions about thresholds for each field as was required for the rules based method since the raw fuzzy matching scores are input into the model. The model can also make much more complicated and flexible decisions than any user could hardcoding rules, meaning more positive matches should be captured, thus improving the process.

Our general view at OpenJaw is to use rules based methods first, and then iteratively improve the identity resolution process later on with machine learning.

Creating the single-customer-view

At this stage, we have established which records match on an individual level but these matches are in isolation. Our aim is to create a single-customer-view, at at this stage in the process, only matches between disparate pairs have been established, i.e. record 1 and record 2 belong to the same entity but what about other records that also belong to that entity? How can these records be linked together to create a customer profile? This step of the process is often referred to as record linkage. There are two commonly used methods to link records together:

Use recursion to iteratively find links that correspond to entities;
Use graph algorithms.

We don’t recommend the first method for the travel domain, as it’s fraught with danger. It’s complicated to code and easily susceptible to bugs. The only reason we would recommend this method is if there are no or very few complicated interactions between matches (not true in travel) or if computing resources are scarce.

The natural choice for a problem like this is to use a graph to link matches together. Graphs are mathematical structures used to model pairwise relations between objects. They are made up of vertices (or nodes) which are connected by edges (links). There are 2 main types of graphs: directed and undirected. Undirected graphs are when edges link two vertices symmetrically. An example of this is Facebook friends; you are both friends with each other, whereas for directed graphs, edges link nodes asymmetrically, i.e. there is directionality component. Twitter is a good example, you may follow someone but they don’t necessarily follow you back.

*Figure 1: Examples of directed and undirected graphs. Facebook ‘friends’ are considered undirected while Twitter ‘follows’ are directional.*

The example shows three people: Mary, Jack and Tom. On Facebook Mary is friends with Jack and Tom and since the relationship is undirected Jack and Tom are also friends with Mary. However, on Twitter, Mary follows Jack and Tom but only Jack follows her back, the relationship is said to be directional.

For identity resolution, we use undirected graphs since there is no directionality between nodes, e.g. if record 1 and record 2 belong to the same entity, record 2 and record 1 also must belong to the same entity by the transitive property. The nodes for our identity graph are the standardised and cleaned dataset and the edges are the matches previously calculated.

There are a variety of graph algorithms that could be used to find and link records together but we used the connected components algorithm. This algorithm finds sets of connected nodes where each node is reachable from any other node in the same set. A connected component is a maximal connected subgraph of the main graph. That is to say, it will find and link all available vertices, i.e. uses the matches to link together all nodes belonging to an entity. Each subgraph is called a component and a vertex with no edges is also considered a component. In this case, each component is essentially a collection every record belonging to an entity. A unique number, called the ‘OpenJawID’ is assigned to each component. The identities in this data have now been successfully resolved.

*Figure 2: Identity graph showing connected subgraphs circled in red where all nodes inside each red circle are considered belonging to the same entity*

So far we have described how to identity resolve data from a variety of sources. The obvious problem being that this data is both static and historic in nature. The whole point of implementing a solution such as this is to get real-time or as close to real-time customer behaviours. Therefore, we need a method to ingest new data and link it to the historical data, as well as assigning previously unseen entities new OpenJawID’s.

In our identity resolution system, we have implemented a delta process that is automated. Currently, it’s scheduled to run every day and a specified time. It ingests new data from all sources, cleans and standardises it, comparing it to the historically resolved data using the same process; blocking, fuzzy matching and rules/machine learning. If matches are found between the new and resolved data, the new data will receive the same OpenJawID, effectively expanding their profile.

Records that couldn’t be matched are assumed to belong to new customers. For these records, the original identity resolution process is run because there can be cases where new customers have made multiple transactions in the delta window and these should be linked. For all new customers a OpenJawID is assigned which increments from the previous maximum id number. The delta data (now resolved) is then appended to the historical data to expand the resolved customer database and is ready for the next delta. This process is fully automated and uses Apache Airflow to manage scheduling and data pipelining.

How well does the algorithm perform?

Up until now, this blog series has been conceptual in nature, dealing with the problem of identity resolution, how it works and some of the challenges associated with implementing it. In this section, we will describe a practical example of the algorithm in action using an example dataset.

Real personal data can’t be used in this article due to it’s sensitive nature. However, it is possible to use publicly available simulated data. While the simulated data has some shortcomings, it is more than enough to demonstrate the power of an identity resolution system like the one we have developed at OpenJaw, since the data includes typos, missing fields and other common data quality challenges associated with this kind of data.

The simulated data used comes from the FEBRL (Freely Extensible Biomedical Record Linkage) program. For identity resolution we use OpenJaw’s Blueprint product, available on AWS Marketplace in Jan 2020. This is a generic implementation of our identity resolution algorithm, that can be configured by the user using a GUI as shown in figure 3.

*Figure 3. Screenshot of the Blueprint: Identity Resolution user interface.*

A sample of the dataset used for these experiments is shown in table 1. The full dataset has 5,000 records, corresponding to 6538 positive matches out of a possible 12.5 million. Some modifications were made to the data to make it closer to the type of data we typically see in the travel industry. Usually, we only see about 70% of records have some sort of ID number, such as passport, national ID or driver’s license. This could be due to agency or hotel bookings where they aren’t required. To mimic this behaviour, approx 30% of the field id_number were randomly removed.

*Table 1: Sample of records from example dataset. Null values indicate missing data.*

To establish a baseline to compare the identity resolution algorithm performance, we used deterministic matching, which is typically done in naive identity resolution systems. This involved using the id_number as a unique identifier, something we have commonly seen in the travel industry. This method yields and accuracy of 49%, with a precision of 100%, recall of 49% and f1 score of 66%.

Results are also shown in experiment 1 from table 2. The results can be surmised by: deterministic matching won’t incorrectly match any records that don’t belong to the same person (precision) but is poor at finding actual records corresponding to people (recall). Using this method would result in a poor single customer view as it only captures half of the correct matches, meaning customer profiles would be sparsely populated and resulting analysis or modelling on them would be suboptimal.

Next, we passed this data through the Blueprint: Identity Resolution algorithm on AWS and used name blocking. As previously discussed in part 2 of the blog, blocking is a method used to reduce the computational load of the identity resolution process that arises due to the non-linear scaling of the algorithm. While blocking isn’t really required in this case due to low input data volumes, we want to simulate how the algorithm would be used in reality.

The exact block used was the first 2 letters of surname. The default values for fuzzy matching thresholds from Blueprint: Identity Resolution were used as well as the built-in matching rules:

If id numbers fuzzy match, records belong to same person;
If given name, surname and date of birth fuzzy match, records belong to the same person;
If given name and date of birth fuzzy match, records belong to the same person.

This resulted in an accuracy of 70%, with precision of 100% (note: this is a rounded value, meaning actual value could be 99.5% or greater), recall of 70% and f1 score of 82% (experiment 2 in table 2).

Another experiment (experiment 3 in table 2) used year of birth as the blocking key, keeping all parameters the same as experiment 2 and this had an accuracy of 76%, with precision of 99%, recall of 77% and f1 score of 86%. From these results, it’s apparent that the identity resolution algorithm shows significant improvement over the deterministic method, meaning customer profiles have been enriched, leading to positive effects on downstream analysis etc.

We note that we used all default values and pre-inbuilt rules for these experiments. Better results could be achieved by tuning fuzzy matching thresholds and having additional rules, more suited to this specific dataset but for convenience, we used the out of the box solution.

Table 2. *Results of experiments on simulated data using a variety of techniques.*

To test how blocking effects performance metrics, an experiment using no blocking, meaning all records were compared to each other was conducted. We wouldn’t normally recommend this, unless the data is small or you have considerable computing resources. But since this is a small dataset, it wasn’t much of a problem. The exact same parameters from previous experiments were used and the algorithm had an accuracy of 81%, with precision of 98%, recall of 82% and f1 score of 89% (experiment 3 in table 2). This shows a reasonable increase on the blocking methods, whether it should be done in reality, very much depends on the conditions previously mentioned.

The results of the no blocking experiment got us thinking on how to improve the algorithm going forward. We decided to combine the results of exact matching, name and year of birth blocking experiments together (experiment 7 in table 2) to see how much lift is gained and how it compares to not blocking at all. The accuracy was 83%, with precision of 99%, recall of 83% and f1 score of 91%. This result shows some promise as a possible avenue for future development but when dealing with anything related to identity resolution, algorithm performance always has to be balanced with resources and acceptable run times. For instance, using 6 or 7 blocks would be a very heavy process and you would eventually get diminished returns, so an acceptable balance must be found. Other, similar experiments were run and the results are shown in experiments 5 and 6 in table 2.

Matching rules versus machine learning

The final experiment of interest was comparing how well matching rule techniques compare to a machine learning approach. As previously explained, there are pros and cons associated with using machine learning to find matches. The main one being the lack of labeled data, but in our case, since the data is synthetic, the data comes with labels. The modelling dataset consisted of 12.5M records (all possible combinations of the 5,000 input records) since no blocking was used. The positive match rate of the data was about 0.05%, making this a severe class imbalance problem. The data was split up into 3 sets, a training set (50%), a validation (20%) and test set (30%). The validation set was used for finding the optimal threshold value for deciding if 2 records belong to the same entity and the test data for model evaluation.

Since there was a severe class imbalance, the training set was down-sampled so that the positive class was about 5% of the data, rather than 0.05%. We used the logistic regression algorithm from the scikit-learn library with default settings (no hyper-parameter tuning). Logistic regression is a more than suitable modelling technique for this problem since it should be linear in nature, i.e. the more fields that are more similar have a higher probability of belonging to the same entity than not. As a result, we don’t expect any non-linear behaviour that this model couldn’t capture. It is also less data hungry and the results are more explainable that other methods such as boosting and neural networks.

The threshold for a positive match was chosen based on precision-recall curves on the validation set and the value of 84% was found to be optimal, this means the model had to be a minimum of 84% confident that two records belong to the same entity. Typically, in machine learning classifiers a default value of 50% is used, but here we want to be very sure that records match to minimise records belonging to separate entities to be linked together.

The results from the test set show the model was 97% accurate, with precision of 99%, recall of 97% and f1 score of 98% (experiment 8 in table 2). This clearly outperforms the rules based methods used previously but in some ways, it’s an unfair comparison. The rules used weren’t specific to this data like the model was, rather they were general rules that have worked for us in previous client implementations but weren’t specific to this dataset, whereas the model could capture more complicated and specific trends from the data. It would be possible to make specific rules for this dataset that would lead to increased performance. Nonetheless, it shows that if you have labelled data, a classifier might be the best option as it would probably outperform any user made rules. Also, it requires less decisions to be made by users, such as deciding fuzzy matching thresholds for each field that was compared and there is no need to make rules, as these raw numbers are inputs into the model.

Conclusion

The purpose of this three part article is to share with the data engineering and data science community in the travel industry how we perform identity resolution at OpenJaw to create a single-customer-view.

In part 1 of the article, we described why identity resolution is needed as well as some of the technical challenges associated with it, such as the nonlinear properties of the algorithm brought about by the large search space associated with finding matches. In part 2, we described how we use data cleansing and normalisation, ‘blocking’ and ‘salting’ to prepare the data for fuzzy matching.

And finally, in this part of article, we continued on with our description of how the algorithm works by discussing how we use empirical matching rules and machine learning to match records and graph theory to link records that belong to the same entity. We also described the results of experiments that were conducted on publicly available data that compare our identity resolution algorithm performance to traditional, deterministic matching.

These results showed that probabilistic matching greatly outperforms deterministic methods.

Potential future improvements to the algorithm were also explored, such as: combining deterministic matching and probabilistic matching, creating a hybrid solution as well as how machine learning could outperform matching rules, once a labeled dataset is available.

Identity resolution for travel retailing (part 3)

Introduction

Written by The OpenJaw Data Science Team