Become the Data Superhero

Published in

IBM Data Science in Practice

6 min readOct 12, 2018

What lies beneath…

Data is probably the world’s most valuable natural resource. There is surface oil (think tar pits) and hidden oil, deep within the earth that takes effort to find, extract, and refine. Like oil, 20% of data is on the surface and easily searchable, and 80% is hidden and difficult to access.¹

However, unlike oil:

the same data can be refined, merged and reused many times over — and
data is growing exponentially whereas oil is a finite resource

What if you could easily find and access all the discovered and undiscovered oil in the world? What if you could easily find and access all the discovered and undiscovered data within your company? In either case you would become kind of a super hero in your organization.

Today, old and new data are generated and comingled from an ever-expanding array of sources: public and private cloud, IOT, web, social, mobile, and big data. Traditional means of finding and operationalizing (making valuable) the data — extracting, transforming, then loading (ETL) — is analogous to trying to extract oil using a paper map, toy sand shovel and a straw. It can be expensive, unreliable, not scalable — it’s just not practical. For businesses, identifying and accessing hidden wells of data using outdated or manual methods is expensive, error prone, and not scalable and again impractical. Worse, there is no single version of truth on which to perform analytics.

Think of the business advantage if you could identify all the sources of data in your company, provide a single, view of the data and continually refine it to reveal insights faster, more accurately and more often to give the business a competitive edge.

The Right Tools for the Job

Today, finding, leveraging and managing data across so many disparate sources is difficult, complex and can be costly. Because there is so much data that organizations need to handle it often can take longer to move large data sets than it does to analyze them. Data virtualization provides a logical and accessible view of the data that exists without having to physically move or alter it — abstracting many of the physical complexities of the infrastructure on which it resides. Such a solution should be capable of spanning heterogeneous stores and identifying metadata with data attribute “fidelity”. The right solution to find this data should include several key ingredients:

Data Management: Configure, administer and monitor data virtualization in all areas
Data Access Methods: Structured, unstructured, semi-structured without limits on where data resides — and access it faster than ever before
Data Governance: Trusted metadata with discovery, design, modeling, definition and testing

ETL can be a time consuming and rigid process. A highly paid and experienced team needs to write the code and rules which takes time and then the ETL runs on nightly or weekly batches. Also, ETLs are not designed to handle all kinds of data, typically focusing on just relational type data stores.

Modern-day businesses are changing fast and seek real time analytics to help reveal insights on demand. As mentioned previously, data comes from numerous devices and sources — social media, mobile — structured and unstructured, and from real time streams (not stored).

The world as we perceive it can change rapidly and often due to financial, socio-economic, political events, natural or man-made disasters and more. As such a new flexible, smart, intelligent model to perform analytics is needed — one that can identify patterns and trends in existing and new data it encounters, faster and more accurately than humans alone.

Data Gravity

Today, the generally accepted practice is to move data to the analytics engine or a data lake. However, the value and volumes of data that many organizations are trying to manage is creating a gravitational pull. Why risk moving or copying it unnecessarily, potentially exposing it to security risks or impacting service level agreements, trying to manage its currency. Instead, it might be more advantageous to move the analytics to the data. No more copying, no rearranging, shuffling, repositioning, no ETL — not to mention the potential resource cost savings of not having to do all this. Visualize this: a powerful system that takes advantage of all that comingled data — regardless of where it is located — and perform analytics on that data where it resides today. No more paper maps, toy sand shovels and straws. This is the high tech, autonomous refinery using GPS to find all the oil wells — data stores — in the world all at once.

Giving Time Back

Yes, ETL is trusted and known but probably not sustainable given the ever-increasing amounts of data and demands for insights faster, more accurately and more often. All this can impact the time, resources and money you need to invest in your business.

Reducing or avoiding some or all of those scripts, polices and people working ETL tasks every night and weekend could liberate those experts to perform higher value tasks.

Again, you might be perceived as a hero by at least giving them their weekends back.

Rethink your Data Platform and Next steps

Let me put all of the above in perspective. It’s time to rethink the physical platform (see figure #1). Instead reframe it as a virtual environment spanning on-premise, cloud, Hadoop, appliances, whatever. Data virtualization can help provide “one source of the truth” and a single logical view across the entire enterprise. And, because data virtualization technology from IBM can leverage many data sources, you can keep your investment in your local repositories. Don’t misunderstand. While everything could be virtualized, my preferred solution is a local corpus of critical business data, virtualizing sub-sets of information in concert.

So, if you’ve read this far, consider IBM’s capabilities to help provide -

A single data view with data virtualization across a wide data landscape
A common SQL Engine for a consistent meta-store and common mapping
Near real-time AI-based analytics to speed analysis and decision making

*Figure 1: Rethinking your Data Platform*

IBM is ready to help start you on your enterprise data virtualization journey at no charge with Db2 Developer Community Edition². IBM also has an Hadoop-based query engine in BigSQL with data virtualization compatibility across Oracle, Netezza, PostgreSQL, and NoSQL sources. Finally, for those of you blazing the IOT trail — that’s covered too. IBM’s progressive approach of bring-analytics-to-the-data helps enables a massively distributed computational constellation of end-point devices, leveraging collective compute and storage in parallel — bringing back only the results required through said constellation at speed.

It just might be the solution that helps you take the step to become a super hero in your business.

For more information, click here.

Al Martin, Data Man Extraordinaire

Follow me on twitter, @amartin_v or listen to the podcast series (Making Data Simple) I host on Analytics Insights available on iTunes or at IBM Machine Learning Hub.

“Twenty percent of the world’s data is searchable. Anybody can get to that 20,” Rometty told “Mad Money” host Jim Cramer on Tuesday. “But 80 percent of the world’s data, which is where I think the real gold is, whether it’s decades of underwriting, pricing, customer experience, risk in loans — that is all with our clients. You don’t want to share it. That is gold.” https://www.cnbc.com/2017/06/20/ibm-ceo-says-80-percent-of-the-worlds-data-is-where-the-real-gold-is.html
https://www.ibm.com/us-en/marketplace/ibm-db2-direct-and-developer-editions/purchase?cm_mmc=Organic-_-Analytics_Database+-+Data+Warehousing+-+Hadoop-_-WW_WW-_-Community+-+Db2+Direct+and+Developer+Edition+MP&cm_mmca1=000000TA&cm_mmca2=10000659&