Using Machine Learning to enhance legacy Data Management

Published in

shipzero

5 min readMay 13, 2019

When implementing Artificial Intelligence (AI) driven solutions, the demands for an underlying data management system are quite challenging: On the one hand we are talking about the typical 4 V’s of big data — volume, variety, velocity, and veracity. But on the other hand, data quality and data governance are playing a crucial role as well. This is where Master Data Management (MDM) as an enabler for innovative and service-oriented data management architectures comes into play.

But can this work the other way around? Can AI improve the effectiveness or the efficiency of Master Data Management? Yes and no — and you’ll find out how, the why and the why not in the following article.

A New Mindset in Data Management

When deriving information from data, business analysts and data scientists use 80% of their time to find, clean, and reorganize relevant data sets. Only 20% of the time can be spent on work that is actually value-generating. Additionally, those data-savvy analysts and data scientists currently become more and more a scarce resource and thereby expensive.

Therefore, corporate budgets are shifting from hardware and infrastructure investments towards making the best use of existing resources and data assets.

„What you’ve seen over the last couple of years is the resurgence about MDM and looking at MDM as being a force of disruption for the digital transformation.”

This is how Suresh Menon — VP and GM for Information Quality Solutions at Informatica — describes the growing importance of Master Data Management. So it is just natural that uprising technologies such as Artificial Intelligence are thrown into the mix when it comes to supporting the cause.

Specifying the AI Use Case for MDM

Most experts agree that AI will have an impact on pretty much every aspect of our life — including our jobs. But this doesn’t say anything about how many and to what degree our current tasks will be replaced or augmented by algorithms. So, in the case of MDM we need to split up the setup-process of a typical project into its core pieces:

Definition of Business Goals
Identification of Master Data & Data Sources
Analysis of Metadata & Data Lifecycles
Involvement of Stakeholders
Evaluation of Infrastructure
Validation of Outputs

To derive actual value from AI, or in this context its subset of Machine Learning (ML), the focus of the evaluation must be on the data-driven subprocesses:

Setting up an MDM project: data driven sub-processes

Identification of Master Data & Data Sources

A lot of organizations have significant challenges with legacy systems and historically grown data structures. In this case it is very handy to use algorithms to identify data of interest as there is usually not one person in the organization that knows all business relevant attributes of an entity. Machine Learning approaches can support the identification of frequently used data and classify it.

Typical technologies to approach those kind of classification problems are Support Vector Classifiers (SVCs) and k-nearest neighbors (KNN) Classification. And in the case of a lot of text documents as a source of descriptive information (as opposed to structured metadata), Natural Language Processing (NLP) can play a significant role in minimizing time-efforts in analyzing convoluted, technical documentations.

It is not possible for ML, though, to identify what data is Master Data. As there are intangible criteria involved — for example the business value of an information — the algorithms can only provide hints and find all sources of our Master Data as soon as we identified it. But those sources of Master Data are critical for carrying out the further steps of in-depth analysis. Such a hopefully complete data dictionary will support entity resolution by providing a feature-rich set against which ML algorithms can be run.

Analysis of Metadata and Data Lifecycles

Machine Learning techniques may be used to identify and resolve link candidates and specify link type as well as link strength. As there might be various types of information and linkages involved, an iterative multi-tier approach is needed.

One option would be to start with automated data labeling through Active Learning. It addresses challenge of labeling data by modeling the process of obtaining labels for structured data that, in the context of its metadata and lifecycle attribution, has not been unambiguously categorized yet:

This approach needs to be combined with some manual assessment get a hold on the functional relationships between the labels identified before. This linkage to the internal reference system is what describes the data element from a semantic perspective. But it can also link the data with external, publicly available set of definitions to validate attributes or gather additional information.

Operational Master Data Management

Finally, there’s the process of actually deciding on which data ‘wins’ in the daily processes. Obviously, this shouldn’t be a manual task. But nowadays it shouldn’t be based on hard-coded rules either as those rules get outdated pretty fast.

The currently common and quite sophisticated way of property-matching is to use a probabilistic method within a closed system — an approach to calculate the probability that two records match from its features. Speaking in statistical terms, this is closest to the Naive Bayes algorithms: probabilistic classifiers with strong independence assumptions between the features.

In the future this may change into utilizing various external data sets to allow referential matching. Those data sets can then either enrich or be enriched by internal, proprietary data. Think of it a comprehensive and continuously-updated reference database, containing for example of demographic data, market data, customer & supplier data or any other data enhancing an internal data pool. All in order to develop better data matching.

Conclusion

The combination of MDM and Machine Learning could be the only way to manage the data explosion in legacy data management systems and provide users easy access to the data they need — breaking the 80/20 dilemma when creating value from data. For that to be achieved, MDM must become ‘self-configuring’ and learn from previous interactions as well as become aware of ongoing changes.

But just as AI is not the silver bullet for every business problem, Master Data Management is not the silver bullet of providing enterprise data quality. It manages just one area of an organization’s data universe, even if this might be the most critical one. MDM has to be combined with an overarching data stewardship program to get a hold on the growing demands towards data management.

Sources:
https://blogs.gartner.com/andrew_white/2017/02/22/the-role-of-machine-learning-on-master-data-management-mdm/
http://tdan.com/automating-data-management-and-governance-through-machine-learning/23972
http://tdan.com/mdm-needs-data-governance/20748
https://siliconangle.com/2017/05/23/master-data-management-ai-helps-make-old-data-technology-new-infa17/
https://www.infoworld.com/article/3228245/the-80-20-data-science-dilemma.html
https://medium.com/appanion/a-five-minute-guide-to-artificial-intelligence-c4262be85fd3