Less is More and Identity Resolution

Warwick Matthews
4 min readNov 16, 2023

--

This article is also available in Japanese at: https://c-datalab.com/ja/blog/idr-matching_20231117
当ブログは日本語でもご覧いただけます:https://c-datalab.com/ja/blog/idr-matching_20231117

In Master Data Management (MDM) there is a tendency to chase data completeness, and in fact this is often used as a KPI of an MDM project or system.

This comes essentially from a “positivistic” view of the World, where we are attempting to discover all that we can about a person or other entity. In other words, The Truth is Out There — we just have to find it.

There are other approaches to this traditional one, we will cover some non-Positivistic approaches to MDM as the subject of a future blog post.

One of the key components of an MDM system is often the Golden Record, which represents the most complete set of data points from the most-trusted sources available.

The “Golden Record” is the centrepiece of many MDM systems.

In a positivistic world we are chasing Completeness and the Golden Record represents the fruits of our best attempts. “MDM in a post Golden Record world” will be a topic for another forthcoming blog article but in the meantime let’s consider the question: when do we know that our best view is good enough? That is, how much data is enough?

Let’s approach this indirectly, by looking at what can be done with very small amounts of data plus a little bit of creative thinking.

A match on just a name between two records is not generally considered strong enough for an Identity Resolution (IDR) system to bring those records together in a MDM context. So what are our options? The usual answer is “get more data”..

We want all the data.. but do we really *need* it?

But let’s think outside the box a little. Sometimes discrete records “travel together” — an example would be a flight booking for a person and their colleague. We might only have the names of each person in this example, but we have both of them together. This can allow us to make extrapolations.

Consider below, where we are interested in a Japanese contact named Mayako Tanaka:

Similar names but no other information… insufficient for high-confidence matches.

We have records for a rural resident, a sporting event spectator, a passenger on a flight, an employee and a visitor to another business. With just names in common we certainly do not have enough information to confidently connect (or exclude) any of these.

But what if we add in some other names, data which came in with the above?

Still name-only data but the additional records make a major difference…

Now let’s consider probabilities. How likely is it the “Tanaka Mayako” & “Chris Jones” who met yesterday at Kasai-san’s office (orange dot) are the same people at the sporting event? Or to put it more logically, what is the probability that these two names appearing together in two different contexts are not the same people?

The same goes for all the variants: Mayako Tanaka appearing at her workplace, on a flight and then at a meeting in London (red dot), together with “Hiro” Hiroyuki Watanabe. And at the meeting (red dot) we also see Chris Jones appearing again. It is highly likely that these are the same people — even though we only have their names.

Conversely it is probably unlikely that “Maya Tanaka” from a Wasabi farm in Shizuoka-ken is the same person as M. TANAKA who flew to London with H. WATANABE — or at least we do not have any reason to connect them apart from casual name similarity. Plus the Wasabi farmer seems to have a partner with a name (Keinosuke Tanaka) that appears nowhere else.

The records we connect using only Name will of course bring with them other data to our MDM equation, such as dates, events addresses and other information — as well as hooks into the other individuals we have mentioned earlier (Chris Jones & Hiro Watanabe). We can progressively collect and connect these names and move towards a sophisticated knowledge graph binding all the data together.

Creative approaches with multiple sparse data sets can lead all the way to sophisticated, multi-layered progressive knowledge graphs.

It must be stated that we have glossed over much of the “Devil in the detail” here in the processing steps that follow from an initial name-based exercise, but this constructivistic approach to MDM can overlay of all kinds of data and connections to produce very compelling — and useful — results in furtherance of our MDM goals.

--

--