Master Data Management, how to match and merge records to unify your data

At the heart of most data governance projects, lies the concept of MDM — Master Data Management — which aims at creating single source of truth (SSOT) for different entities.

Julien Kervizic
Dec 2, 2019 · 16 min read
Photo by Clayton Cardinalli on Unsplash

In most organizations, it is particularly impactful arounds entities such as products and customers. In order to achieve clean results and data, records needs to be associated and merged together to form this unified view of the entities.

There are a multiple challenges that data professionals face when dealing with the creation and maintenance of master data entities. From the definition of a matching strategy, the setup of a consolidation strategy and field authority strategy.

Matching Strategy

Matching Method

The matching methods used can be setup to handle different matching certainties, with the most certain matches going through an automated process and the more uncertain ones, going through a review process. This type of approaches can be useful to get the appropriate samples to train a prediction model.

Matching types

Exact Match: An exact match relies on being able to link two different source of information based on the existence of a specific key that allows to match the information. This can be for instance an EAN code for products, an national identity number, the concatenation of a name and an address.

Phonetic Match: There can cases when an exact match might not be sufficient to handle the merging of information, this usually happened when data needs to be inputed by hand based on vocal input. For instance, recording a name or address at a point of sales (POS) system. When this happens, we need to be able to relax the matching algorithm to handle cases that are phonetically similar.

In python the Phonetics library incorporates a few of the most popular phonetic matching algorithms, generally these algorithms work well for English but often do not support other languages.

In the above example, we can see how Julien and Julian, have the same phonetic representation using the soundex algorithm, explaining why Starbucks Barista often misspell my name to Julian.

Phonetics algorithms such as soundex are incorporated in some databases such as PostGres or SQLServer.

Fuzzy match: A fuzzy match, is a match that is not exact. Using fuzzy matching comes with is a trade-off between exactness and coverage of the match. Some of its application relates to allowing for different types of spelling or input misspelling, matching users when other criteria match and their address are within a certain radius of each others. The following tutorial from datacamp shows how we can compare the different strings and generate a score enabling this inexact match.

Propensity match: Propensity matching provides a different type of in-exact matching, but this time relying on a prediction model to provide the likely value that a match is actually valid.

They provide a certain level of confidence that we should consider the match, with a threshold for assignment that can be configured.

Provided we are able to obtain some training data, we can leverage supervised learning methods such as logistic regression in order to generate this propensity score.

In order to compare two records containing text inputs for matches, we need to be able to generate features either from their raw, cleansed or phonetic representation and being to compare the two set of strings. We can also add additional features to the mix.

The above will create features based on the edit hamming distance, it will look up pair of column values in a data-frame assuming the same column names are suffixed by _1 or _2 and compare their raw representation and lowercased representation using hamming normalized distance. The following is a sample output:

Based on that data, we can train a prediction model, this is easily done using sklearn:

The probability score can then be computed and compared to the threshold value.

Negative Match: Negative Matching provides additional conditions / exclusion rules, about when not to match two records together. They can be used to exclude unlikely event.

An example of when negative matches could be used is showed above. When comparing two set of user records and figuring out that the distance between the records two address is above a certain threshold. This type of negative match, when implemented within a workflow could make it so that an automated match is not applied but still flagged for manual review.

Considerations

Name frequency: When used for matching, one of the factors to take into consideration is is the frequency of names. Different names and different frequency of occurrences, and common names should need a higher burden of evidences before being matched together.

The above example shows how unlikely my name is to be found compared to John Smith. Given that it is so unlikely to be found it should require much less external evidence than for a John Smith. If for instance we were to try to match records of a John Smith, we might need to add either an additional phone number of address match to the mix.

Pseudonyms and Alternative naming: It is worth noting that different entities can have alternative namings, be them official, stage names or pseudonyms.

Take the example of the city of Astana originally called Akmolinsk further renamed to Tselinograd and Akmola, Astana and now called Nur-Sultan or the city of Madras nowadays named Chennai. Or of some of the historical characters, Alexander the great, Pliny the young, or singers EMINEM, 50 cents, or even actors such as Ben Kingsley whose true name is Krishna Pandit Bhanji.

Languages: Languages tend to be an important consideration to have in matching different entities. Take of example, the matching of different sources in English and Chinese. The entity might be the same, but there is different representations of the same name across languages.

Take for example the Nur-Sultan example mentioned before, the same entity has different name representations, and just one of these representation Nur-Sultan has a different language representation in French, English and Chinese.

Records merging across languages is heavily reliant on an appropriate multi-lingual data model coupled to having linking data across languages.

Abbreviations: Abbreviations impact how different text records might need to be matched, from first names shortening such as Bill for William or Peter for Peter, to title abbreviations such as Dr. for Doctor, they make it harder to find matches without additional processing.

Data Cleansing & Data normalization: Trying to match differently text fields also requires some degree of data cleansing. From handling potential trimming of spaces, special characters and punctuations to correcting misspellings, there are numerous steps to take in order to standardize the input and allow it to find an accurate match.

For example, phone numbers can be provided in quite different formats, for the same true phone number. They can be provided with a county code information, in + (+44) or 00 format (0044), or without country code, they can be provided as a pure numerical chain (0601020304) or with separators (060–10–20–30–4). In order to be able to perform a complete match, it is important to normalise the data in the same format.

REGEXP is a particularly useful tool in this respect, allowing to match strings and substrings based on different patterns.

In the example above, the REGEXP matches sequences of digits and sequences of digits preceded by a +. This provides a decent first step towards cleaning and normalizing phone numbers. There are better solutions for this particular use case than relying directly on REGEXP, but it proves useful as soon as you need to do something a little custom.

Authority Strategy

Hierarchy

Let’s look at the information contained from a government database of national identity and names, we can consider this as a highly trusted source of information. Compare this to an input form on a website, on a website where typos or bad input can occur both for the identity and names.

The example shown above shows how an overall hierarchy strategy could work in selecting (in yellow) profile attributes in a consolidated record.

Having an authority hierarchy of information help in this case to prioritise the information coming from more trusted sources than others, in the example explained above we would consider the government database the more authoritative source of information and use that information where provided. In case the information from this source is not available, we would be relying on the input form, but only as a complement for the authoritative source.

Time bound

Fields such as names are fairly immutable, they can technically change — for instance when getting married or requesting a name change with the government. But in the majority of cases, they wouldn’t be changing frequently.

Other fields, such has an address or a phone number, have a higher tendency to be mutable. You can easily change address or phone numbers. Fields with a higher mutability tendency tend to benefit more from applying a time based authority strategy, you would want to have the most fresh source of information.

The table above shows what would happen to the attribute selected in the consolidated profile if we extended the previous example, with a time bound authority strategy for the email and mobile fields.

Voting rule

A voting rule authority strategy can provide a high level of signal when trusted source of information are inexistent, but there exists a high number of available sources that can be correlated.

Extending the example discussed above with a voting strategy for the address field, beside the fact that more misspelling of my name happen, we can see that the the address provided by the government data source is no longer selected for the unified profile. This is due to more than 1 data-source providing an alternate address.

Considerations

Multilevel strategy: Depending on the number of data-sources available, it might be useful to group the data-sources by classes and apply a multi-level authority strategy on these different fields.

Time Dependent: No matter what strategy you use, time will always be a factor to some extent. It is important to place time boundary on most fields to apply the strategy, in the voting strategy example it wouldn’t have been very smart to apply the voting strategy if all the records resulting in the vote for “Rodeo-drive” had been before an authoritative record (in terms of source hierarchy). Likewise if the last update we got from an “authoritative” source is a few years old, it might still be worth reconsidering whether to use a “fresher” source of information.

Consolidation Strategy

Hard Merge

Above is an example of how a hard merge would look like based on a mixed hierarchy / time bound authority strategy.

There are benefits to performance a hard merge such as reduction of database size, more efficient queries, and ease of extraction of these authoritative fields. There are however a few draw back the three main ones being 1) data loss 2) Irreversible merging 3) not all authority strategies play well with hard merging.

The forward only merging strategy, will tie the records together after a matching condition has been met. It will not impact the historical records.

in yellow the events considered in the merged profile

Let’s take an example, where we have two profiles coming from different sources.

  • We are receiving a stream of events from each of these sources associated with the profiles
  • At some point some information is added to the first profile that allows for matching the identity to profile 2
  • Profile 2 is hard merged onto profile 2 at this point, the historical data is still leveraged from Profile 1, but not from Profile 2
  • As new events come in they are directly associated to Profile 1

One method to implement the forward merging of associated data, is by modifying a redirection table. For the example mentioned above, this can be done in the following way in python:

profile 1 before merge: ['1', '3', '4']
profile 2 before merge: ['2', '5']

At some point, we merge the profiles together and update he redirection table, ie: both ids now refers to the same first profile:

Pushing additional events after the forward merging of identities:

This gives us the following events in each profile after the forward only merge happened:

profile_1 after forward merge: ['1', '3', '4', '6', '7', '8', '9', '10']
profile_2 after forward merge: ['2', '5']

Beside the id re-direction and the merging of attributes, the merging of the profile might incorporate further step such as the deactivation of the previous profile.

The backward merging strategy, will tie records back historically and merge together the different records that have been identified with matching conditions.

in yellow the events considered in the merged profile

In the case of backward merging, the data related to both profiles is merged onto one both historically and as times goes on. In code this type of strategy can be implemented as an initial merge of events related to both profile and an id redirection.

This gives us the following events in each profile after the backward merge happened:

profile_1 after forward merge: ['1', '3', '4', '2', '5', '6', '7', '8', '9', '10']
profile_2 after forward merge: []

Soft Merge (Association)

One of the main advantage of using a soft-merge strategy is that the association can always be un-done. It does however have some space and performance disadvantages in such that all the records would need to be maintained and queries would need to be created that would need to look up records that are associated and then apply the authority strategy on these records.

Full Association

A full association provides an association record for the different entities provided. Think of a table [id1, id2] that allows to associate any record with another one, this table can allow any record to fetch information to any associated user record, and an authority strategy could be applied to the different values onto it.

An application for this is for instance the ability to match an anonymous session to a logged in user identity. The identity key that was provided prior to login is a specific cookie value, for instance a google analytics client id, while once logged in a new identity is provided that needs to be associated to.

The full association allows to both extract on-going activities across the different identities, but also allow to leverage historical data through it.

In order to make an association work, we need to be able to tie a given profile to one or more profile. Taking back the previous code example, we can make a few alterations to the profile class to support this.

We need to incorporate a way to 1) store associations 2) add associations 3) query the relevant information obtained by association (in this case events).

Following the same pattern as in the previous example and performing the association. We can see how the full set of events is now captured post association.

Pre Assoc Profile 1:  ['1', '3', '4']
Pre Assoc Profile 2: ['2', '5']
Post Assoc Profile 1: ['2', '5', '8', '9', '1', '3', '4', '6', '7', '10']
Post Assoc Profile 2: ['1', '3', '4', '6', '7', '10', '2', '5', '8', '9']

But contrary to the profile merging approach, it is still possible to retrieve the events directly associated with each profiles:

Post Assoc Profile 1:  ['1', '3', '4', '6', '7', '10']
Post Assoc Profile 2: ['2', '5', '8', '9']

Association with filtering

An association with filtering provides a little more control as to how the different data-points will be consolidate together. It allows to implement a forward only kind of merge as an association, useful when you are only able to leverage the joint record if a user has accepted new terms of services for instance.

Implementing association with filtering is possible but requires some additional code changes in order to be able to exclude the un-wanted records.

Yielding the following results when ingesting the different events.

Pre Assoc Profile 1:  ['1', '3', '4']
Pre Assoc Profile 2: ['2', '5']
Post Assoc Profile 1: ['8', '9', '1', '3', '4', '6', '7', '10']
Post Assoc Profile 2: ['6', '7', '10', '2', '5', '8', '9']

Unlike with the forward merging strategy, it is possible to take the view point from both profile 1 and profile 2 perspective.

Considerations

Volume of data: The overall data volume is a factor that plays in the decision of which strategy to apply. Consider transient identifiers such as website session ids, each able to create a temporary profile. A given user can have hundreds of temporary profiles and potentially one logged in profile. In order to fetch the authoritative profile, the application would have to go through all these temporary profiles and apply the authority strategy. Another example where volume of data can be impactful is when relying on user input. Think of a website letting its users input information related to their favourite holiday destination. This could result in thousands of duplicate records for each destination, due to spelling error, different names, … Keeping only a soft merge strategy would be quite difficult.

Uniqueness of matches: It can happen that the matching strategy has to deal with non-unique matches. In case of a soft-merge strategy, only an association key would need to be added to properly deal with that information. While in the case of a hard merging, this may lead to information being duplicated across key records. Additional matching rules may be added such as only merging the first initial record found, but uniqueness of match is a consideration to have when setting up the matching and merging strategies.

Performance and complexity: Using a soft-merge strategy typically offers lower read performance than a hard merge strategy and as we saw in our simple code examples above implementing a soft merge strategy usually requires some additional complexity.

Uncertain matches: Different types of matching strategies and identifiers. lead to a different risk of matching records that do not belong together. When dealing with strategies that lead to risky matches, it is often better to apply a soft merge strategy. Soft merge makes it easier to accept more un-certain matches are consolidated together as the association can always be undone.

Regulation: Regulation, sometimes plays a role in terms of how records should be merged together. It can for instance, dictate what data should be available to be used in a consolidated profile or what data could be used for which purpose. Association with filtering is the consolidation strategy that would most easily satisfy different regulations, but also the most complex to integrate.

Summary

Often there needs to be a granular setup not only at the entity level but at the field level in order to apply the right strategy to the right piece of information, and the question is then more about balancing the different pros and cons of the different (sub)strategies that need to be applied.

It is often a safe bet to start from a risk-aware strategy to record consolidation and use a soft-merge strategy with review in order to ensure the best quality and handle the more advanced aspect and more automated consolidated merging at a later stage.


Hacking Analytics

All around data & analytics topics

Julien Kervizic

Written by

Living at the interstice of business, data and technology | Solution Architect & Head of Data | Heineken, Facebook and Amazon | linkedin: https://bit.ly/2XbDffo

Hacking Analytics

All around data & analytics topics

More From Medium

More from Hacking Analytics

More from Hacking Analytics

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade