Decoding the Data Landscape: A Comprehensive Journey from Data Warehouses to the Lakehouse Era

Dains
5 min readDec 11, 2023

Encountering buzzwords like “data lakehouse,” “data lake,” and “delta lake” might leave you curious about the nuances that set them apart. In this article, we’ll embark on a journey to demystify these technologies, exploring their distinctions and understanding how they contrast with the traditional data warehouse.

First Generation Data Analytics Platforms

The genesis of data warehousing aimed to empower business users with analytical insights. This involved collecting data from operational stores and storing it in centralised warehouses using a schema-on-write approach.

However, this era was not without its limitations. The coupling of compute and storage in on-premises appliances posed challenges, and the systems struggled to keep up with the exponential growth of data. Moreover, they lacked the flexibility to support unstructured data, such as video, audio, and text documents.

Data Lake + Warehouse Architecture

To address these challenges, the second generation took a different approach by offloading raw data into cost-effective data lakes, a cost-effective storage systems equipped with a file API and using open file formats like Apache Parquet. This transformation, kickstarted by the Apache Hadoop movement, introduced a schema-on-read architecture, providing agility at a low cost.

In this architecture, a strategic subset of data within the lake undergoes ETL processes to feed into a downstream data warehouse, like Teradata. This setup caters to crucial decision support and BI applications since data lakes do not support ACID transactions.

Later cloud data lakes such as S3, ADLS, and GCS gradually replaced HDFS. These cloud data lakes offered better durability, geo-replication, and remarkably low costs. In the cloud, the architecture remained mostly the same, except for the emergence of data warehouses like Redshift and Snowflake. It represented a harmonious blend of cost-effectiveness, scalability, and accessibility.

While the cloud data lake and warehouse architecture seemed ostensibly cost-effective, leveraging separate storage and compute the adoption of a two-tier architecture introduced some pressing problems, such as:

  1. Reliability: Sustaining consistency between the data lake and warehouse proved challenging and costly.
  2. Data Staleness: The lag between new data arriving in the data lake and its integration into the warehouse resulted in outdated information, hampering real-time decision-making.
  3. Limited Support for Advanced Analytics: Traditional machine learning systems struggled to seamlessly integrate with warehouses.
  4. Total Cost of Ownership: Continuous ETL processes and data duplication in warehouses contributed to increased costs.

Data Lakehouse

To address the constraints inherent in the second-generation two-tiered cloud architecture, researchers from Databricks, UC Berkeley, and Stanford University collaborated on a study. They sought solutions by posing the question: Can data lakes, relying on standard open data formats like Parquet, be transformed into high-performance systems capable of delivering both the performance and management features of data warehouses, while also facilitating fast, direct I/O for advanced analytics workloads? This inquiry led to the inception of the Lakehouse architecture with Delta Lake, an open-source storage framework at its core.

The design of a Lakehouse involves a strategic approach to data storage and management. The first crucial concept is storing data in a low-cost object store, like Amazon S3, utilising standard file formats such as Apache Parquet. On top of this object store, a transactional metadata layer is implemented. This configuration enables the incorporation of management features, such as ACID transactions or versioning, within the metadata layer. Delta Lake storage format is a perfect example of such a storage system. As open-source software, Delta Lake enhances Parquet data files by incorporating a file-based transaction log, enabling ACID transactions and scalable metadata management to overcome the limitations of a data lake.

While a metadata layer addresses management capabilities, achieving optimal SQL performance requires additional techniques employed by data warehouses. In a Lakehouse, where existing storage formats limit format changes, various optimisations, such as caching, auxiliary data structures, and data layout enhancements, can be implemented without altering the underlying data files. A notable example of this is the Databricks Delta Engine, a query optimiser for Spark SQL.

Decoding the Data Landscape: In Summary

In conclusion, the evolution from first-generation data analytics platforms to the contemporary Data Lakehouse architecture marks a transformative journey in the world of data management. The initial phase of data warehousing, while impactful, grappled with challenges such as scalability and flexibility. The subsequent shift to Data Lake + Warehouse Architecture, leveraging cost-effective data lakes and open file formats, addressed these issues, but introduced complexities like data staleness and limited support for advanced analytics. Recognising these challenges, the collaborative research effort by Databricks, UC Berkeley, and Stanford University gave rise to the innovative Lakehouse architecture with Delta Lake, providing a robust solution that combines the advantages of data warehouses and data lakes. This approach, exemplified by strategic data storage in low-cost object stores and the integration of a transactional metadata layer, showcases the versatility and scalability needed for modern data management. The Lakehouse model, with its emphasis on agility and performance, represents a crucial step forward in optimising data storage and management, ushering in a new era of efficiency and capability.

For further details and in-depth insights, interested readers can refer to the research paper authored by the Databricks team and their collaborators. The research paper is available at https://www.cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf

--

--