Don’t throw more data at the problem! Here’s how to unlock true value from your data lake

August 15, 2017 | Written by: Jay Limburn

Just recently in the UK, we’ve seen the dangers of making decisions based on incomplete or poor data play out on the world stage. The Prime Minister called a general election three years earlier than she needed to, basing her decision on data that showed that it would allow her to win a bigger majority in parliament. Evidently, the data her team used was lacking: her party lost its overall majority, and the UK ended up with a hung parliament.

So, what had the Prime Minister’s team missed? The election saw a higher turnout of voters under age 35 than previous elections[1] — a demographic that her policies had failed to win over. The result was a bad decision based on incomplete data.

We may not all have the fates of nations in our hands, but the lesson is one from which we can all learn. Companies grapple with a version of this same challenge every day, when they try to make important strategic decisions based on data that may be incomplete, inconsistent, inaccurate, or out-of-date.

To lessen the likelihood of bad decisions, many companies have invested in extending their data lakes: the idea is that the more data you have, the less likely you are to miss something important. But throwing more data at the problem isn’t always enough to protect you from poor choices. Having too much information can prevent you from seeing the forest for the trees — particularly if that information is poorly organized or difficult to find.

Disillusioned with big data? You’re not the only one

It’s a familiar story: companies respond to the hype around big data by building huge data lakes, but then find they don’t deliver the expected value. The data is there, but knowledge workers can’t easily access it, and therefore can’t work effectively. Moreover, the company is now paying for new systems to house all this data, and needs to find highly skilled data scientists and engineers to maintain them. What’s gone wrong?

One common issue is cultural: despite having the technical infrastructure in place, different departments are often reluctant to share their data. We discussed this challenge in my recent blog post, Data governance — You could be looking at it all wrong, but essentially, data owners need to have confidence that the data they share will be accessed, used and protected appropriately.

A lack of effective data governance within data lakes prevents users from trusting the system, so they hoard their data instead. As a result, its value is lost to the rest of the company. Even if users are persuaded to share their data, it can be difficult to decide (a) how to share it, and, (b) what kind of data cleansing needs to happen before it is safe for others to use. Answering these questions may require yet another large IT investment.

The other major challenge is findability of data. This issue is often exacerbated when companies treat their data lake as a dumping ground for assets, rather than a well-organized and actively managed archive. In these circumstances, it is difficult for users to find or understand assets within the data lake, and when they do, they are of questionable quality and unknown provenance. Again, this discourages data sharing and reuse.

The problem is widespread: it was recently reported that data scientists, business analysts and other knowledge workers estimate that they spend 80 percent of their time searching for, cleaning and organizing data, and only 20 percent actually analyzing it.[2] But what if there was a way to resolve the challenges around both data governance and findability of data in a single move?

Enter IBM Data Catalog

Built on Watson Data Platform, IBM Data Catalog is IBM’s next-generation, cloud-based enterprise data catalog. It promises to provide a central solution where users can catalog, govern and discover information assets, and it is designed to slash the time spent searching for and hesitating over sharing data, so that you can focus on extracting business value from your data assets.

With Data Catalog, you will be able to index the assets already in your data lake, and then extend your strategy to include data from other sources too. For example, you can take advantage of the built-in governance and control functions to safely ingest enterprise assets that you were previously unable to move to the data lake due to complexity or ownership issues. Data hosted by shadow IT teams or SaaS providers, open datasets, data from social media or sensor feeds, local spreadsheets and other dark data, and so on — Data Catalog will help you liberate the value from all of these sources.

Beyond the advantages of uniting all your assets in a single, governed catalog, Data Catalog will also offer:

Self-service capabilities: With its intelligent catalog capabilities, Data Catalog will provide users with true self-service access to all the assets they are authorized to see. Its advanced search features will also help users zone in on the data that is most relevant to them, contributing to productivity.

Driving culture change: With Data Catalog, every user becomes a data custodian. By making the process of cataloging simple, and automating the enforcement of governance policies, it will encourage users to share data. They can also curate and comment on assets, which makes the data easier for other users to find in the future. These factors drive a culture change towards data-centricity, creating a virtuous circle that continuously improves data governance over time.

Uncovering insights: By providing a space where users can bring different datasets together and work with them in new ways, Data Catalog will help knowledge workers get deeper, more accurate and more nuanced answers to their questions, sooner.

Integration with other solutions: Data Catalog will integrate with IBM Data Connect through the fabric of Watson Data Platform, making it easy for users to access physical data and move it into shared sandboxes or other workspaces for further manipulation or analysis. It is also integrated with IBM Data Science Experience, giving users access to a set of powerful data science tools they can use to explore new datasets and enhance their analysis.

The lure of the cloud

A few years ago, it was common to hear people say they would never move data outside their company’s firewalls. However, times are changing. Recent high-profile cyber attacks have demonstrated that keeping data on-premises may be no safer that storing it in the cloud. In fact, there’s even an argument that specialized cloud service providers may be able to take advantage of economies of scale to invest in better security capabilities than most traditional companies can afford in-house. As a result, many organizations are now considering moving at least some of their data into the cloud.

For these organizations, creating a metadata index of your data with Data Catalog will be an ideal starting point. You won’t actually have to move your data to the cloud — only your metadata. In the process, you can get comfortable with cloud solutions, and start to foster support within your organization. As you gain confidence, Data Catalog will also help you assess which of your data assets naturally gravitate towards cloud platforms, and how best to prioritize the next steps in your cloud strategy.

If we’ve piqued your interest, learn more about Data Catalog today.

[1] Source: How Britain voted at the 2017 general election (YouGov)
[2] Source: 2016 Data Science Report (CrowdFlower)

Originally published at on August 15, 2017.