Master Data the noun in Big Data sentences

Rohin Bhargava
Information Management
4 min readJan 14, 2014

You might be thinking no not another post on Big Data. Well Big Data is the Buzzword in the industry, now all you need to do is whisper and you will see a “Mockingjay” affect all around you — Social networking sites like Twitter, LinkedIn and any event in the technology space all have discussions going on around Big Data. So why not here?

Given my expertise in Data Governance, Data Quality and Master Data Management I am often approached with my views on the Big Data Phenomenon and how I see the Data Management space changing in response to this phenomenon. My idea is very simple — If you want to make Big Data Useful to your organisation then you need to link it to Master Data.

Let me begin by clarifying the two terms we are discussing here — Big Data and Master Data.

Big Data as I understand it is a collection of data that is large and complex and may be structured, semi-structured or unstructured that is beyond management through conventional/traditional means of Information Management tools. The kind of data that we are talking about could be Unstructured data that is generated through the likes of Twitter, Facebook, LinkedIn, YouTube etc. Data data generated through Devices, Sensors and Systems. Semi-structured data that includes the likes of Documents etc.

Master Data is the consistent and uniform set of identifiers, extended attributes and relationship that describe the core entities of the enterprise and are used across multiple business processes and supporting systems. Examples include customer, product, employee, territory, supplier, and vendor data objects.

So how do you translate my Idea into practice. Big Data is complex and huge, Master Data is structured and complex. It is not easy to combine two very complex structures. If we are to then establish a link between these two complex data sets we need to do so through simple, loosely coupled and easy to manage solution.

My solution to the above problem is based on a theory that we need to think of Big Data as “Sentences” and Master Data as “Nouns” and if we can identify the “nouns” in these “sentences” we should then be able to identify the “verb” and in turn derive meaningful information. This should tell us how the “Noun” is interacting or behaving and that analysis can help us gain useful insights.

The problem occurs when a “Noun” like ‘John’ occurs multiple times and we have no way of discerning the right ‘John’. So how do we tackle this problem could be the next question. The master data solution that your organisation may have could be the solution to the above problem.

Traditional Master Data solution would allow you to identify the various ‘John’ and their role — Customer, Supplier, Employee etc. It would also provide you with basic information or ids that allow you to identify them beyond just their name. We now need to extend the traditional Master Data information to also include certain BDI (Big Data Identifiers). Attributes that will allow you to link the Master Data to the Big Data — Social Ids: Twitter handles, Facebook login, email ids. Brand names — aliases for product data. Machine Ids for Dark Data etc.

During the loading process of your Big Data into the Repository it would be useful to identify these BDIs which could then be co-related back to the master data down the line for analysis. These BDIs then serve as a dimension to your analysis and enable you to create actionable analysis. For example if you are able to identify the product that is generating maximum noise in the market you should then be able to filter the noise to understand positive or negative feedback and then create an action plan according to that noise. If you are able to add the customer dimension to that same analysis you may be further able to filter the noise to get better and deeper perspective of what your ‘Real’ customers think about your product.

Another way to handle the challenge would be if we can create a “universal id” for all our ‘Nouns’ across data domains and tag our Big Data Information set during loading into the Big Data Repository to these ids we should then be able to use these ids for further analysis and aggregation of our results. We can use these ids to slice and dice the Big Data information and extract meaningful analysis.

Sceptics to the above approach may raise issues around privacy and identification of people. We can in such scenarios revert to the Group based analysis. Master Data repository allows us to create groups based on various parameters that could be demographic, behavioural or based on preferences. Based on parameters that would help us with our decision-making we can group the master data — Highest selling products, grouping by geography, gender, profitability etc. The same group definition can then be used to slice and dice the Big Data set and make useful inferences.

These above methods are a starting point and as we analyse, assimilate and mature we should be able to build upon these links.

--

--

Rohin Bhargava
Information Management

Code, Data, Design, Poetry, Prose, Philosophy not in any particular order.