From Data Warehouse to Data Lakehouse — Part 3

Seckin Dinc
6 min readMay 14, 2024

--

Photo by Henry & Co. on Unsplash

There are different rumours on why we ended up with Data Swamps over the years. Some people say that Software Engineers became Data Engineers in a night without having any data administration background, some people say that companies just wanted to store the data as it comes without any use case validation, some people say that Data Modellers killed Agile Software Engineering so companies opened all the gates of data acquisition. Whatever angle you look, you might see different assumptions. Either you agree or disagree with any of these assumptions, the fact is that we ended with Data Swamps!

When the industry started to shift from the Data Warehouse (DWH) to Hadoop, we didn’t only change the technology but also many methodologies we have been using over the decades. Data governance started to be the common enemy of agile development, schema on-read overcame the well-designed data models and respected data modellers, data quality was no longer a first-class citizen.

At this article, I will walk you through the transition of Data Lakes into Data Swamps where most companies, projects and people drown over the years. It all started with the very first on-premise Data Lake adoption, Hadoop.

What is Data Lake?

The data storage was a major problem twenty years ago. The database administrators had to calculate monthly storage usage and estimate the requirements for the upcoming months and years as they had to order hardwares upfront. Also the data storages were not flexible enough to store every information we wanted. So it was hard to be creative.

Today we can easily talk about image processing, chatbots, text generators, etc. All these creative initiatives require diverse and massive amount of data. It was not a surprise that these were developed in the recent years. How? Thanks to the cheap, flexible and scalable data storages! Here we are introduced with the Data Lakes.

A data lake is a centralized repository that allows organizations to store their data in structured and unstructured formats as-is, without having to structure the data. Data analysts, data scientists and other practitioners can directly access to the raw data and implement any type of analysis they want.

Data lakes contain different components to work efficiently. Let’s take a look for the most critical ones;

Storage Tools

Data lake storage is the storage layer within a data lake architecture where vast amounts of raw data are stored in its native format until it’s needed for analysis or processing.

Data lake storage solutions include cloud-based object storage services like Amazon S3, Microsoft Blob Storage, Google Cloud Storage, as well as on-premises solutions like Hadoop Distributed File System (HDFS).

Let’s check their key characteristics;

  1. Scalability: Data lake storage solutions are typically designed to scale horizontally, allowing organizations to store petabytes or even exabytes of data cost-effectively.
  2. Flexibility: Data lakes accommodate various types of data, including structured data from relational databases, semi-structured data like JSON or XML, and unstructured data like text documents, images, or videos.
  3. Cost-effectiveness: Many data lake storage solutions are built on cloud-based object storage services, which offer pay-as-you-go pricing models and eliminate the need for upfront infrastructure investments.
  4. Schema-on-read: Unlike traditional data warehouses where data is structured and organized upfront (schema-on-write), data lake storage adopts a schema-on-read approach.
  5. Integration: Data lake storage solutions often integrate with a variety of data processing and analytics tools, allowing organizations to analyze and derive insights from their data using familiar tools and frameworks.

File Formats

As we learned above, there are various types of solutions provided as the data lake storage. These solutions are flexible enough to store the most common file formats. Different file formats, compression and partition algorithms exist to support various use cases. Here are the most common formats are CSV, JSON, AVRO, and Parquet. Let’s take a look their capabilities.

  • CSV: Suitable for compatibility, spreadsheet processing, and human readable data sets. It cannot handle nested data. It is open to data quality problems. Mainly used for exploratory analysis and Proof of Concepts.
  • JSON: Default data format used in the APIs. It supports nested data structures. It is easy to read simple files but the nested fields can become challenging. Mainly used for landing data, or API integration. It is advised to prepare and store the data before analysis.
  • Avro: Suitable for storing row data efficiently. It has schema and supports schema evolution, especially when combined with Kafka. Mainly used for row-level operations and data ingestion.
  • Parquet: Despite to Avro, Parquet is a columnar storage format. It works well with Hive and Spark for SQL-based querying. It offers schema support and efficient storage. Parquet serves as an excellent reporting layer for data lakes.

Table Formats

Table formats enable seamless interaction with data lakes, similar to how we interact with databases or data warehouses. They are metadata constructs designed to simplify working with files in tables. Let’s take a look into the open-source table formats;

  • Apache Hive: Apache Hive is a data warehouse infrastructure built on top of Apache Hadoop for querying and analyzing large datasets stored in Hadoop’s distributed file system (HDFS) or other compatible file systems. It provides a high-level interface and query language, called HiveQL (Hive Query Language), which is similar to SQL (Structured Query Language), allowing users to write SQL-like queries to process and analyze data.
  • Apache Iceberg: Iceberg is an open-source data lake table format developed by Netflix that provides fast and scalable analytics on large datasets stored in cloud object stores such as Amazon S3, Google Cloud Storage, and Microsoft Azure Blob Storage. It is built on top of Apache Parquet, supports various data types and serialization formats, and provides advanced indexing techniques and metadata management to optimize queries and improve performance.

From Data Lake to Data Swamp

From pure technological point of view, Data Lake is an amazing concept. Open source communities and commercial organizations constantly invest into the data lake ecosystem. But still most of the data lake projects fail to deliver value. Why?

Access to the Raw Data Became Bottleneck

Large volumes of data assets need metadata to provide information on their context, format, structure, and usage. This metadata is essential for data analysts and scientists to effectively perform their analyses. Without proper governance and metadata, the time required to retrieve the necessary data assets increases exponentially.

No Valid Use Case for Data Acquisition

With the Data Lakes, companies started to store as much as data possible with a hope to use them in the future projects. They envisioned that the data analysts are going to dive into these gold mines and collect valuable insights. They thought data scientists can combine all the 3rd party data with the internal data sets to develop hyper-personalised AI products. None of them happened. Neither Data Analysts found their ways through the dark storages of Data Lakes nor Data Scientists were able to use the 3rd party data sets with massive missing or duplicate data.

Lack of Data Validation and Quality Enforcements

As data practitioners started to focus on easier and faster data acquisitions, they deprioritised the data validation and quality enforcements. These are the fundamentals of data reliability. Without proper data reliability measures in place, it is impossible the detect the underlying issues in the data that leads to trust issues towards the data stored.

Conclusion

Data Lake as a concept and technology is a massive development in the data domain. I believe that without the data lake implementations the AI revolution that we are observing today wouldn’t be possible.

On the other hand, organizations failed to pay the enough attention to the data management and governance topics while being mesmerised by the scalability of the data lakes. Today, most of the Data Lakes turned into Data Swamps that more than 80% of the data is untouched.

As organisations solved the data storage and scalability problems, I think it is crucial to turn back to the good old days of data warehousing with a twist: Data Lakehouses!

--

--

Seckin Dinc

Building successful data teams to develop great data products