The Future of Data Lakes

Published in

IBM Data Science in Practice

5 min readMar 13, 2020

As organizations wrestle with the storage, management and processing of ever-growing data volumes, many have embarked on the concept of “data lakes”. My mental image of a lake conjures up a tranquil reservoir of pristine water, surrounded by beautiful scenery, blue skies overhead, delivering clean well-regulated uncontaminated water to satisfied consumers on demand. Despite all the initial good intentions the reality can sometimes look more like a stagnant abandoned stretch of polluted liquid surrounded by chimney stacks and grey skies — that not even the most desperate forms of wildlife would drink from.

Data Lakes — The Logical and Physical views

The logical view of a data lake is a secure, well-managed, controlled single point of access to all of an organization’s data. Physically though, it may consist of many geographically dispersed sources of data, in and beyond the firewalls of an organization with various levels of data completeness and quality. This sprawl happens by osmosis (ironic given the water analogy) through the natural cycle / evolution as an organization grows, shrinks, acquires or is acquired. The many sources of data and applications that consume them were probably never designed to be shared. And some sources of data are just too sensitive to be shared or too large to be moved. The days of centralizing all data into one place are long gone — it’s just not practical to do, due to time, cost, security, service level agreements and other factors.

Data might be considered the world’s most powerful asset. It could become a key factor in defining those that “have” and the “have nots”, impacting both society and business. If organizations don’t know how to leverage their data, they will simply be outsmarted by those that do.

The AI Revolution — Learning more about your Data.

Machine learning and artificial intelligence is very prevalent in the media. If implemented correctly it can help organizations better understand their data — from its discovery, completeness and quality through to discovering patterns in usage, deeper insights, understanding; as well as predicting outcomes with degrees of confidence and taking prescriptive actions faster and more accurately than humans alone. All this can significantly help augment consumers of this data — whether an end-user of an application or service — to advanced data scientists and data architects. The sum of the parts (Human and AI together) is greater than the individual parts on their own.

Skills, Expertise, Leaders

It’s not just about the technology though. I believe businesses are looking for a trusted advisor to guide them through their own personalized journeys across data lake, cloud, AI projects. One size does not fit all. For example, organizations would need a strategy that supports public, private, hybrid and multi-cloud, ML, AI, deep Learning, big data, data lakes, governance, integration and more. And an experienced advisor with a proven track record of implementing all these kinds of solutions is also invaluable.

Ingredients for Successful Data Lake implementations

From my own client interactions, I see three fundamental challenges that organizations face as they seek to leverage and embed data as a critical asset:

Efficiently housing an endless corpus of data across its various stages of maturity
Secure, authorized, instant retrieval of trusted data assets
Achieving smarter business outcomes by leveraging AI

For me, this translates into the following capabilities:

· Hybrid cloud capabilities (compute, storage and location agnostic)

· Data virtualization across all data types (bring analytics to the data)

· Knowledge catalog (trusted and secure data with one source of the truth)

· Data replication for 24x7 availability (with all-in-one data movement)

· ML / AI tools and runtime (infusing AI)

· Data and AI assistants (automate complex data and AI related tasks)

· Built in governance top to bottom

· Console for visualizing and making all elements simpler to manage and administer

A Vision for delivering a next generation of Data Lake.

As described above, a fine line exists between having a pristine data lake or ending up with a contaminated data swamp. For many clients, lakes have become much too complex and way too expensive due to the entanglement of compute and storage. A new data lake story is required.

Information Architecture needs to be simple, must support any kind of data, be infused with AI and emerging tech, and be in reach of every company. This is more easily said than done in most people’s minds.

However, advances in data discovery, data catalogs, data virtualization, controlled replication, integration, all aspects of governance, pervasive security, AI tools and runtimes, cheap storage and compute as well as open source can all help to deliver on the concept and vision of the data lake described above. So, with all these capabilities and technologies available what’s preventing organizations from doing this successfully?

Speed to market and agility is a key factor.

Assembling all the pieces that are necessary for the next generation of data lake is non-trivial. It can be time consuming and therefore risky and expensive. IBM Cloud Pak for Data is a Data and AI platform, microservices offering, that pre-integrates many of the capabilities needed to help deliver data lake projects that support many forms of structured and unstructured data and its processing. Cloud Pak for Data is infused with AI delivering trusted data assets and an information architecture for building a new generation of data lake.

Cloud Pak for Data is a platform designed for customers looking to implement petabyte scale data lake solutions that can be deployed across public, private, hybrid or multi cloud. It can be used either as a complete data lake (storage, compute, runtime) or as a side-car to an existing one (leave data where it is, e.g. HDFS). See figure #1.

Figure #1: Cloud Pak for Data — establishing a data lake.

And IBM hybrid cloud can help provide organizations with a seamless user experience across all platforms abstracting the complexities of physical location of data and assets. IBM can also share its wealth of expertise on big data, cloud and AI engagements to help instill confidence as organizations embark on their data lake journey.

Next Steps.

In summary, organizations are looking for a simpler, proven approach to implement and achieve the benefits of a data lake. What is needed is an approach that can leverage all of an organization’s data, without having to rip and replace or rearchitect their existing investments and avoid reskilling / retraining their staff. It should leverage the latest ML technologies and infuse AI capabilities that result in smarter business outcomes. A platform that can be plugged in, be up and running in hours or days as opposed to weeks and months.

IBM Cloud Pak for Data could be a smart first step towards a next generation data lake solution. Click this link for more information.

The Future of Data Lakes

Written by Al Martin