What Is a Data Lakehouse?

And do I need it for my organization?

Susan Coleman
Slalom Data & AI
5 min readJan 24, 2023

--

Photo from Matthew on Unsplash

Multiple technology companies — including Microsoft, Databricks, AWS, Snowflake, and others — now provide platforms that enable organizations to build data lakehouses. But what is a data lakehouse and how do you know if it’s right for you? And what are the practical benefits?

To answer these questions, it helps to have some background on the concept of the data lakehouse and discover how it can be an asset to your data management strategy. To do this, though, we first have to dive into the data lake.

Take the plunge into the data lake

The term data lake was coined about a dozen years ago by James Dixon, who at the time was the CTO of a business intelligence company called Pentaho. It was used to illustrate the difference between the way data was commonly being brought into organizations and a new option that could accommodate a broader range of data types.

Data warehouses — which contribute the “house” part of the data lakehouse — are repositories for structured data, which is generally stored in tables with rows and columns. Information you provide in an online survey is an example of structured data. Once you hit the submit button, your information is indexed and added to a data warehouse with the information from everyone else who responds to the survey. The data in a data warehouse undergoes a process called extract, transform, and load (ETL) or extract, load, and transform (ELT) to cleanse and structure the data so it can be properly stored and accessed later.

Storing structured data in data warehouses has its advantages, as it ensures the data is in a format that can be easily queried for reporting, data visualizations, and other business intelligence tools. But many data types can’t easily be transformed and put into tables with columns and rows and a clear schema.

This is where the data lake comes in.

Just as an actual lake doesn’t put any restrictions on how water will flow into it during the spring thaw, a data lake allows for a freer flow of data into an organization without the need for cleansing and transforming. This is especially useful for semi-structured and unstructured data types and sources such as video, audio, sensor, text, or social media feeds, for example. Data enters the lake in its natural form, without being subjected to any manipulation to force it into a tabular format. From the data lake, it can be copied or moved to other locations to be cleansed and transformed as necessary for use by analysts and data scientists.

But this can lead to its own set of problems, as explained by Databricks Co-founder and CEO, Ali Ghodsi:

“… a lot of organizations end up actually having a coexistence of a data lake where they have all the data for data science, and then subsets of that data get moved into a data warehouse where … it can actually be consumed by BI and reporting.”

It’s also a common practice to move data in the other direction — from warehouses into lakes — which means that there is even more of a chance of data duplication and the potential for inconsistencies in your data.

Structure and flexibility: The data lakehouse

The data lakehouse merges the best aspects of a data warehouse and data lake into a single architecture, giving you the flexibility to work with any type of data but without the risk and complexity involved in copying and moving data around. You can work with the data, regardless of the type (structured, unstructured, and semi-structed) directly in the data lake. This is done through a transaction layer that operates between the data in your lake and the tools you use for business intelligence, reporting, data science, machine learning, and other types of analysis.

Data lakehouse architecture

What does this mean on a practical level? A data lakehouse can help you reduce the infrastructure costs that can get out of hand when operating separate data silos. Keeping your data in a lakehouse configuration also not only allows users to access more data and multiple workloads in a single place, but it expands the ways you can work with the data beyond dashboards and reports, for example with machine learning and data science. This opens all new areas of value, whether you’re trying to meet the demands and expectations of your customers or you’re saving lives.

With a data lakehouse, you can keep your data — all your data, regardless of whether it’s been ingested through a form or pulled from a social media feed — in a single platform and work with it where it resides, so you can get to those valuable insights faster and more accurately. The better and faster your access to this insight, the faster you can act on it and take impactful actions for your organization.

All organizations rely on data to run their operations, but every organization’s data landscape is unique. So how do you know if a data lakehouse is right for you? Do you:

  • ingest multiple types of data, such as text, images, video, audio, clicks, sensor data, or log files?
  • need to ingest data in real time but are currently limited to batch loads?
  • acquire data from a wide variety of sources, including both internal (first-party) and external (third-party) providers?
  • lack the ability to deploy data science and machine learning in production to support new use cases?
  • maintain separate data lakes and data warehouses or some other siloed configurations that necessitate frequent copying and/or moving the data?
  • require days, weeks, or longer to transform, analyze, and report on the data you collect?
  • organize and cleanse data within spreadsheets or using other manual methods?
  • have issues with duplicate or outdated data negatively impacting your ability to make informed decisions?

If your organization is experiencing these challenges, then you could see significant benefits from a data lakehouse. Mark Kobe, a global data and AI leader at Slalom Consulting, sums up the value of the data lakehouse as follows:

“Organizations are seeking to use data science and machine learning to tackle their most important and urgent challenges. However, they’re often limited by their legacy data platforms and architectures. As they move to the cloud, it’s vital to adopt the lakehouse architecture to enable them to reduce costs, move faster, and unlock new use cases.”

To make adopting lakehouse technology easier, the Slalom data lakehouse accelerator can help you get started with a framework that will set you up for success with an agile, flexible, and future-ready data architecture.

Want to learn more? Read our white paper on Azure Databricks to learn how Slalom and Microsoft can help you bring a modern culture of data to your organization.

Slalom is a global consulting firm that helps people and organizations dream bigger, move faster, and build better tomorrows for all. Learn more and reach out today.

--

--

Susan Coleman
Slalom Data & AI

Content creator and storyteller, focusing on tech topics. Manager, Content — Google & Microsoft at Slalom Consulting.