Discovery vs. Production with Big Data

by Diego Klabjan

Opex Analytics
The Opex Analytics Blog
2 min readMar 18, 2016

--

It has been noted that big data technologies are predominantly used for ETL and data discovery. The former has been around for decades and is well understood with a mature market. Data discovery is much newer and less understood. Wikipedia’s definition reads:

“Data discovery is a business intelligence architecture aimed at interactive reports and explorable data from multiple sources.”

Data lakes based on Hadoop are bursting out at many companies with the predominant purpose of data discovery from multiple sources (that are explorable). It is easy to simply dump files from all over the place into a data lake, and thus the data source requirement in the definition is met. What about the part on “interactive reports”? The verb discover, as defined by a dictionary means, “to learn of, to gain sight or knowledge of” — this sounds quite disconnected from interactive reports.

Indeed, in business, data discovery is much more aligned with the dictionary definition than Wikipedia. Data discovery as used with big data and data lakes really means “to gain knowledge of data — in order to ultimately derive business value — by using explorable data from multiple sources.”

The vast majority of the applications of big data are to conduct data discovery in the sense of learning from the data. The knowledge gained per se does not provide business value, and thus such insights are operationalized separately in more established architectures (e.g. EDW, RDBMS, BI). A good example is customer behavior derived from many data sources (e.g. transactional data, social media data, credit performance). This clearly calls for data discovery in a data lake and insights written into a relational database and productionalized by means of other systems used in marketing or pricing.

There are very few cases of big data solutions outside of ETL being actually used in production. Large companies directly connected with the web successfully deploy in-production big data technologies (e.g. Google for page ranking, Facebook for friend recommendations), but outside of this industry, big data solutions in production are rarely observed.

It is evident that today big data is used predominantly for data discovery and not in production. I suspect that as technologies mature even more and become more self-served, the boundary will gradually shift more towards production (assuming that business value would be derived from such opportunities). Today big data is mostly about data discovery. The Wikipedia definition about interactive reports is for now mostly an illusion, and it is better to stick with the proper English definition.

___________________________________________________________________

If you liked this blog post, check out more of our work, follow us on social media (Twitter, LinkedIn, and Facebook), or join us for our free monthly Academy webinars.

--

--