What is “Truth” in Data?

9 min readMar 22, 2024

This article is also available in Japanese here.
当ブログは日本語でもここでご覧いただけます。

Master Data Management (MDM) could be said to be an exercise in Truth.

No, seriously!

MDM aims to make available across the organisation a dataset that is comprehensive, accurate and timely. The degree to which data possesses these qualities will greatly influence how valuable (that is, useful) it is. One could even say that “truth” in business data is all about accuracy and timeliness — was the data “true” when it was collected (and then distilled into a Golden Record) and was that recently enough that this is still the case?

That all being said, there is no guarantee that timely data is accurate data. Data can be years old — but still be accurate. Data can be as recent as an hour ago — and be inaccurate. The opposite of both could also be true! What is an “accurate” record? What makes data “true”?

Classical MDM and Golden Records reflect a positivistic view of the Universe, which is carried through to data entities (people, companies, places, things etc). In this view there is an objective Truth about every entity — and our job is to discover that truth and pass it along to our stakeholders.

The “positivistic” data worldview was touched on in a previous article .

We can think of this positivistic approach as akin to using a telescope to find a known object in order to understand it better. That is, we know our target is out there; we just need more information about it.

But is this really how our data works?

Our classical MDM system:

Receives large amounts of data of multifarious formats, provenance and quality;
Attempts to classify and cleanse that data;
Transforms that data to a standard format and loads it to a dedicated environment;
Matches and clusters the data into coherent records (hopefully);
Analyses those records and selects the best parts to use in a Golden Record
Uses that Golden Record to fulfil data requests for constituents

Traditional MDM wisdom is to add more data points to improve our resolution, i.e. the more we know, the more we know, about a person or other entity.

*As we match new data to an entity record we can understand more about it.*

Question: at what point in the do we actually know the entities (people, places, things) we are dealing with?

(Hint: we don’t).

MDM systems generally function on a premise that clustering and filtering records based on shared (or at least similar) attributes will yield the truth about an entity. That is, if you have a bunch of records and you accurately match them to your customer file then you will likely end up knowing more (or at least having more data) about those customers.

*When does data become a “real” person?*

As a general rule this produces good records as outputs… but it is not actually how the data works at all. In reality our system receives and matches a handful of attributes in records that are loaded via disparate vectors into our system, and then we make assumptions about a person/company behind them. If Record #1 contains what appears to be a personal name and an email address and we match it to Record #2 containing the same email and a postal address we then conclude that we now know 3 pieces of information about that person.

In reality that process is not positivistic (discover the truth) at all, but is actually constructivistic (build the truth), only our subjective conclusion is positivistic i.e. we believe we are learning more about a real, live person.

*As humans we need to believe there is* **something** *behind the data.*

In reality the MDM system is merely connecting data points and we humans are concluding that the connected dataset tells a story of an objectively real entity.

Put differently, the principle of Constructivistic truth says that the “Truth” about an entity is contextual. In other words, it doesn’t matter what might be objectively true in theory if we can do what we need to do with the data we have.

Remember the scene in the film “A Few Good Men” where Tom Cruise’s character Daniel Kaffee exclaims in exasperation to Joanne Galloway (Demi Moore)?

“It doesn’t matter what I believe — it only matters what I can prove!”

This is like constructivistic identity — we assess the records as received, not measured against some theoretical definition of the “perfect” record.

“That’s all well and good” you might say, “but what about truth in the real world of enterprise data? Surely it’s not that complicated!”

Your business probably receives data directly or indirectly from customers, and why wouldn’t those records be accurate — why can’t they be a ground truth?

Firstly customers show us only what they want — and it is trite to say that today’s consumer (especially online consumers) increasingly have an instinctive sense of the fungibility of their personas — they can adapt them to suit the needs of the moment. So yes, customers do not necessarily tell the truth all the time — a real-life example is that I routinely lie about my date of birth on websites that casually ask for it. One’s DOB is invaluable information to cybercriminals and so I do not share mine lightly.

But even more than this, for all our technological achievements the collection, collation and use of data is still a fundamentally human endeavour. People mishear and/or mistype names, conflate records, or just load the wrong file.

*Having an Australian accent & a name like “Warwick” in a NYC coffee shop is a major MDM challenge!*

Consider the classic TV series Gilmore Girls. The two main characters in the show are Lorelai Gilmore and her teenage daughter Rory. Except that Rory’s real name is… Lorelai Gilmore. And for those who know the show well, there is a third Lorelai Gilmore, Rory’s paternal great-grandmother. If our business builds a single record for “Lorelai Gilmore” at 37 Maple Street Stars Hollow, and we do so by conflating data originating from both Rory and her mother Lorelai, is that data inaccurate and therefore “untrue”?

Objectively (positivistically), the answer is “Yes!”… but constructivistically the answer is probably “Who cares?!”.

We will discuss this further when we delve into “fit for purpose Identity” in our fifth & final piece.

In an earlier article we mentioned the concept of Corroboration (sometimes also referred to as “pluralization”), via count and volume of sources e.g. 3 out of 4 sources have the same residential address for a person we have matched via name. Corroboration/Pluralization does have merit here in establishing “truth” by improving the odds of data being measurably accurate.

However, another shady neighbour of the “more is better” school of thought is what we call the “Triangulation Fallacy”. MDM systems use their Identity resolution (IDR) capabilities to attach extra data to a golden record from multiple different sources. We can divine the probable truth of an entity by “triangulating” the same facts from these different sources. A simple example is Mailing Address: if we see the same address for a contact corroborated 3 or 4 times it’s a good chance to be correct, right?

Well…maybe not. Receiving the same data over a period of time from a source might mean it has been confirmed repeatedly by that source… but more likely it’s the same record sent to us many times over a period of time, and the “update date” stamped each time a data transfer happened.

Is more data better…or just the same truth, repeated?

The more insidious danger is data being self-referential. That is, data comes from multiple sources but is in fact is from a single origin. Good for coffee, not so good for data. This can happen for many reasons, a common one being that our company and our data partner share the same MDM technology which is in turn plugged into the same data provider (e.g. Experian, Dun & Bradstreet, Acxiom, Equifax etc). If both parties have paid for data enrichment from the same source there is a non-zero risk that records will start to converge without either side being aware of this fact. An even worse scenario, both authors of this article have been in situations where the data being acquired by our enterprise was not only self-referential, but revealed upon investigation (of suspiciously high match rates… TANSTAAFL) to have actually originated with us… in other words we were being sold back our own data! In this age of ubiquitous real-time data capture this is a growing risk for MDM (there is a lot more “inbred” data in the industry than is generally acknowledged).

One additional thought on establishing a “working truth”. A good, robust approach that works well is the “Dipstick method”: randomly sampling data, then investigating it for accuracy pursuant to formal definitions of what “accurate” looks like. Sampling does not have to be burdensome or super-sophisticated, so long as we adhere to a few core principles:

Ensure that your frame of reference is truly representative of the overall population,
Have a consistent, practical operational definition of “accuracy” (e.g. the phone ultimately connects to the right person),
Ensure the adjudication process is homogenous and consistent, and
Don’t stop whenever you find something; complete the entire sample.

Once you have your sample, create some analytics around it, identifying predictors that perform best against your sample (e.g. recency; frequency; even demographics).

From there you can potentially use analogues (lookalikes) to identify more accurate/true sections of your data — as well as sections that are not accurate or of unknown quality.

Sampling & testing can also be hypothesis driven. For example, when I have an accurate name and address what is the % likelihood of having a correct phone — then use that to build an overall operational quality plan for your data.

One final reminder: Quality is in the eyes of the beholder, just like what is considered “golden” is in the eyes of the beholder. For example, a contact at a business may be “accurate” to our salesperson (i.e. it connects to the person who can say “yes” to a contract), but if it is not the operational contact at the company it will not be “accurate” from the perspective of the support team actually managing the ongoing relationship.

In our next article we will delve into the ultimate limitations of the positivistic Golden Record regime, before we posit in our final piece a new view of what “good” means in MDM, and how we might build systems and processes to give effect to that new paradigm.

John Nicodemo is one of America’s preeminent data leaders, with a career that has included management of data & content teams in the US, Canada and globally. He has led data management organisations in major businesses including Dun & Bradstreet and Loblaw Companies Limited (Canada’s largest retail group), and has been called upon to work with some of the World’s top companies on global data strategy and solutions. He is presently advising the U.S. National Football League as they completely reinvent their fan intelligence and data sharing ecosystem.

Warwick Matthews is a speaker, entrepreneur and experienced CDO with over 15 years of expertise in designing, building and managing complex global data, multilingual MDM, identity resolution and data supply chain systems, building new best-in-class solutions and integrating third-party platforms for major corporations in North America and Asia-Pacific. He is currently Chief Data Geek (aka CDO & CTO) of Compliance Data Lab (コンプライアンス・データラボ株式会社), headquartered in Tokyo, Japan.

What is “Truth” in Data?

Written by Warwick Matthews