Applying Machine Learning to Data Stewardship

Madhu Kochar
Inside Machine learning
6 min readAug 12, 2019

Madhu Kochar and Martin Oberhofer

photo: Herbert Ponting

Master Data Management (MDM) solutions are designed to effectively manage your master data. Let’s assume for example a bank A requires another bank B doing a merger & acquisition. Then in the IT departments you need to reconcile the systems managing your master data such as customer, products, employees, suppliers, etc. An MDM system helps you with that process by using a matching algorithm to decide whether or not for example a person record of John Doe from bank A is the same person as a person record Jim Doe from bank B. This is shown in Figure 1 below. To determine an answer to this question, the matching algorithm is comparing attributes like names, nationality, data of birth, social security numbers, etc. and depending on how similar or distant the values in these fields are, assigns weights to them — these are the numbers like 1.2, 7.4, etc.

In IBM MDM, there are many different individual functions the matching algorithm uses like edit distance, phonetics, nickname resolution, geocode resolution, transliteration, etc. to allow a deep, fuzzy inspection in real-time at big data scale — our largest customers run on billions on records. The total score 16.2 in the example in the Figure is the sum of the individual weights on the attributes. The total score is then compared against the lower threshold (in the example of the Figure that’s 10.0) and the upper threshold (in the example of the Figure that’s 20).

Depending on that comparison, the match result is classified as no link, clerical or autolink case (autolink are also known a merge cases). The results in the clerical category are also known as duplicate suspect tasks is the task list for the data steward.

Figure 1

Now looking at this task list, a day in the life of a data steward resolving duplicate suspect tasks might feel like Phil Conners played by Bill Murray in Groundhog Day — or Charlie Chaplin in the movie Modern Times working at the assembly line — doing the same thing over and over and over again. As a result, the satisfaction with work is low and we heard from our clients there is a significant amount of turnover in their data stewardship teams. Digging into the that problem we found:

  • When a master data management (MDM) system gets deployed — it’s a net new technology providing a sophisticated function known as matching which in essence is a capability to identify whether or not two or more data records from the same or different sources are duplicates and should be reconciled.
  • No matter which matching technology is used, there is a risk of false negatives and false positives. Too many false negatives could create the issue that you might not find cases which are relevant for Anti-Money Laundry (AML) compliance. False positives might cost you reputation if customers get upset if their data has been erroneously merged with the data of somebody else. For fear of such cases — the thresholds for the boundary non-link to clerical task for a data steward and the boundary of clerical task for a data steward to auto-link are initially set in such a way — that much more decisions then necessary are routed to data stewards. In other words — a lot of cases end up in the clerical task bucket which could have been decided automatically which creates boredom with data stewards when they realize its more of the same every day.
  • Making matters worse, consider the following: A medium size MDM project on initial deployment might load 20 million records into MDM. If only 1% of these records would produce a match result in the clerical range for clerical review by data stewards, that’s 200.000 tasks. If a data steward is able to resolve lets say 200 tasks a day — a single data steward would need roughly three years (no weekends) to clean up that task list. Keep in mind — during these 3 years new tasks might be created if additional sources are onboarded or new customers records by a growing business might be created — so its very hard to catch up ever.

To address these problems we applied Machine Learning in IBM MDM to reduce the amount of work for data stewards significantly. As input data for training machine learning algorithms we used the resolution history of tasks data steward have already done. For each duplicate suspect task based on the data steward resolving it we had the information if it ultimately was a no link or a merge case which makes it labeled data. The key insight here is that with the help of machine learning we could capture the wisdom of the data stewards in a prediction model which allows for new tasks to predict with a high degree of certainty how a human data steward is going to do resolve this particular task.

This is insight you can derive by looking at what the data stewards have actually done using machine learning. With that insight — you can then assist data stewards in two ways: First, while introducing the new functionality, you can show the data stewards the recommendation on how to resolve the clerical task using the machine learning prediction. This already reduces the work per task significantly — mostly review and approve.

Once they trust the system, the data stewards can enable a flag that the machine learning backend takes the action automatically if the prediction is above a certain prediction threshold, e.g. lets say more then 95% certain of the recommendation. Once this flag is enabled the task is not even showing in the task list of the data stewards anymore.

We compared over a dozen different machine learning in terms of prediction quality for this particular problem and found that a Random Forest model produced the best results. However, to get very good prediction quality, such a model required approximately 5000 tasks resolved by data stewards as training data. Exploring clustering and active learning approaches, we could decrease this requirement to approximately 250 tasks to train the Random Forest model to the same level of accuracy.

Figure 2 shows conceptually the flow of the solution which we released earlier in 2019.

Figure 2

Here’s a figure that combines the figure above with the figure at the start of the post:

Figure 3

In summary, by bringing machine learning to MDM, we could reduce the amount of tasks created for data stewards by up to 2/3 which yields a significant reduction of work for the data stewardship team and hence also a significant amount of labor costs. At the same time, the remaining tasks are the ones which are less repetitive in nature and more complex allowing to focus the data stewards on more interesting tasks. If you want to learn more about this IBM MDM innovation, you can find more here: IBM MDM Machine Learning

--

--

Madhu Kochar
Inside Machine learning

VP @IBM, Analytics and Data- Public and Private Cloud. DevOps, Hybrid, Enterprise clients. Opinions are mine.