Drowning in Data? Come Up for Air at the Data Lakehouse

A singular architecture for all your data needs

Susan Coleman
Slalom Data & AI
7 min readFeb 21, 2023

--

Photo by Kindel Media from Pexels

We’ve all heard the term “big data.” It’s been used for roughly 30 years to describe a size of data that’s so enormous it can’t be quickly and efficiently captured and processed. Though the practice of gathering and analyzing data to solve problems has been around for centuries, the advent of digitized data in the 20th century set the stage for the vast volumes of data organizations work with today.

But “vast” doesn’t even begin to cover it. Though predictions vary, it’s generally assumed that the total amount of data consumed in the world will reach 160 to 180 zettabytes by 2025. If you compare this to 2018, which saw total global data of 33 zettabytes, it’s clear that the amount of data we consume will continue to grow exponentially.

For organizations wanting to draw insights from the data that’s constantly flowing in and out, this volume can be overwhelming. And it’s not just the volume, but the variety of sources and types of data we work with today — including sensor, surveillance, survey, and log data; consumer sentiment data from social media feeds and website comments; PDFs and other text-based data; data from point-of-sale, customer relationship, and marketing software; images; and so much more — that’s causing many organizations to feel like they’re drowning in data.

So, how can you keep afloat in this data deluge? What’s needed is a way to ingest, store, and work with as much data — and as many different types and sources of data — as possible without introducing unnecessary complexity into your organization. This is exactly what a data lakehouse delivers. To understand how this works, let’s first look at a why managing your organization’s data has become so challenging.

From columns and rows to tweets and feeds

If all the data an organization had to manage could be put into spread-sheets and data warehouses, managing ever-growing volumes of data wouldn’t be so difficult. This type of data, which is generally stored in a tabular format with rows and columns creating a schema, is referred to as structured data. There are clear relationships between the different data points within structured data entries, such as a customer number that’s connected to a billing address and payments, or an expense code in a general ledger that can be used to track all spending against an expense type. Because of its strict schema, querying and analyzing structured data is largely a straightforward process that is vital to organizations looking for current and historical insights into their operations.

Example of structured data in a tabular format
Example of structured data

But what about data that doesn’t fit this schema? And what if you need the data to go beyond descriptive and diagnostic analyses to aid more of your predictive and prescriptive needs? For example, say Jane Smith in the table above found that three of the five pencils she ordered had arrived broken. If she posts a negative review of the seller or the product online — whether that’s on the retailer’s website, a personal blog, social media, YouTube, or other outlet — both the retailer and manufacturer would want to capture this to monitor for trends and possibly make changes to improve the products or services being offered.

More and more organizations are recognizing the benefits of utilizing unstructured data. The potential insights go way beyond retail and customer sentiment to industries and use cases such as manufacturing, healthcare, policing, recruitment, and more. According to MIT Sloan:

“Landlords could better monitor and manage their properties and improve the quality of life for tenants by using information from social media, video cameras and police reports, for example. And governments could use a combination of structured and unstructured data to improve cities and better engage with citizens.” — MIT Sloan, “Tapping the power of unstructured data”

From hindsight to foresight, how the value and difficulty of data analytics grow in parallel
Based on the Gartner Analytics Ascendancy Model

The information contained in emails, legal documents, resumes, sensor data, closed circuit TV feeds, MRI scans, and other forms of unstructured data is essential for developing a holistic view of your project, business, campaign, or operations and better serving your customers, patients, or constituents.

But unstructured data can’t simply be brought into the organization and saved directly to the data warehouse. To be used by business intelligence (BI) tools, such as for reporting, dashboards, or other visualizations, it would first have to be transformed to fit the tabular structure, which often involves cleansing, deduplicating, and formatting the data.

But data scientists also need the data in its raw, native format for use with the more predictive technologies, such as machine learning (ML) and artificial intelligence (AI). For many organizations, this has led to storing data in multiple architectures, which only adds to complexity and the risk of duplicate and inaccurate data.

As organizations have begun ingesting more and more unstructured data to gain deeper insights, it’s become clear that the strict structure of the data warehouse can be a hinderance. A different architecture was therefore developed to allow for a freer flow of all kinds of data.

Testing the waters in the data lake

An option for organizations wanting to maintain their unstructured data in its native format is an architecture known as the data lake. Data lakes don’t have a set schema, so there are no restrictions on what types of data you ingest, or what sources your data comes from. Your unstructured data can be stored right alongside your structured data. So, in that one centralized location, you can house your tabular data that can be accessed by BI tools, and your unstructured data that can be used by ML, AI, and other predictive technologies.

But because data lakes are meant to be open and non-restrictive, they’ve proven difficult to govern. Without limitations on the types and sources of data that can be ingested, some organizations find that they’re stuck with more of a “data swamp,” where data pours into the lake without proper attention to value or reliability. This then necessitates other structures to house data once it’s been vetted and transformed, which again leads to a siloed data landscape and more complexity. These downsides have led some organizations to look for yet another solution for their data management needs.

The data lakehouse: Throwing data management a life preserver

In a recent blog, we discussed the data lakehouse architecture and how it captures the best aspects of data lakes and data warehouses for a more streamlined approach to data management. In brief, lakehouses make use of a transactional layer between your analytics tools and the data residing in your lake. This layer provides the ability to maintain stricter governance over your data than you could achieve when just using a data lake. Your data analysts, engineers, and scientists can conduct queries, enrich specific data sets for ML, and build automated pipelines to extract, transform, and load (ETL) data for downstream analysis — all at the same time, without lifting and shifting data, and without pausing the flow of data coming into your organization.

One of the pioneers of the data lakehouse architecture is the company Databricks. Its Delta Lake technology forms the transactional layer that’s the foundation of the lakehouse. The Delta Lake can operate on top of any cloud data lake, including Microsoft Azure, and is the default storage platform for operations on Azure Databricks.

With Azure Databricks, you’re not limited in the types of data you can work with, who is able to work with the data, or what purposes your data will serve. With comprehensive capabilities around storing, cleansing, sharing, processing, analyzing, and modeling your data for everything from reports and dashboards to machine learning and data science, the potential to extract truly meaningful insights from your data is far greater than with any other data management approach.

The Slalom data lakehouse accelerator can help you get started with a framework that will set you up for success with an agile, flexible, and future-ready data architecture.

When it comes to your organization’s efforts to modernize, innovate, and grow, the importance of a solid data strategy can’t be overstated. A recent study conducted by MIT Technology Review Insights and sponsored by Databricks highlighted this point when it noted:

“… respondents emphasize the data challenges they face in the endeavor to embed AI more firmly in their business: 72% say that problems with data are more likely than other factors to jeopardize the achievement of their AI goals between now and 2025.” — MIT Technology Review Insights, “CIO vision 2025: Bridging the gap between BI and AI”

Don’t let data be your downfall! You can get the most out of your data — all your data — with the right strategy and the right technology. If you want to learn more about Azure Databricks and how Slalom can help deliver lakehouse technology that brings calm waters to your data deluge, check out our whitepaper.

Slalom is a global consulting firm that helps people and organizations dream bigger, move faster, and build better tomorrows for all. Learn more and reach out today.

--

--

Susan Coleman
Slalom Data & AI

Content creator and storyteller, focusing on tech topics. Manager, Content — Google & Microsoft at Slalom Consulting.