Data warehouse vs Data Lake vs Data Lakehouse

M Haseeb Asif
Big Data Processing
5 min readJan 2, 2022

With the ever-increasing amount of data produced, Cloud provides many benefits for data processing and analytics, such as scalability, reliability, and availability. In addition, there are various tools & technologies for data processing and analytics in the cloud ecosystem.

In any analytics platform design, compute, and storage are fundamental to the performance of the data platforms. There are three major categories of analytics platforms — data warehouses, data lakes, and data lakehouses. Let’s have a brief look at them.

Data Warehouse

A data warehouse is a central repository of business data stored in structured format to help organizations gain insights. Schema needs to be known before writing data into a warehouse. It pulls the data regularly from different sources and formats them to the schema already in the warehouse. It is designed for business analysts or less tech-savvy users. Some use cases are batch reports, BI dashboards, and visualizations.

A data warehouse is optimized for processing and storage such that queries are performant. It shouldn’t be confused with the database systems designed for transactional data. Instead, it is the aggregated data from the different sources, including the database. As a result, a data warehouse offers a higher quality of data and faster query results. Cloud offerings for data warehouses are Google Big Query, Amazon Redshift, Azure SQL Data warehouse, and Snowflake.

Enterprise Data warehouse(EDW) architecture (source)

A data warehouse cannot handle raw or unstructured data, and it is expensive to maintain a warehouse with the ever-growing amount of data. Also, it is not the best solution for complex data processing such as machine learning or predictive analytics.

Data lakes

A data lake has a separate storage and processing layer compared to a legacy data warehouse, where a single tool is responsible for both storage and processing. A data lake stores data on an object store such as Amazon S3, Google Cloud Storage, or Azure Data Lake Storage. It is generally processed with tools such as Spark or Athena. Azure Data lake is such an offering from Microsoft.

So, A data lake is an ample storage that can store structured, semi-structured, and raw data. The schema of the data is not known as it is a schema-on-read. Therefore, it is unknown how the data will be used compared to a data warehouse where data is already structured and schema is known beforehand. In addition to that, the data lake is suitable for a data scientist who can process the raw data. In contrast, a data warehouse is more business user-friendly. It is ideal for machine learning, predictive analytics, user profiling, etc.

Data Lake architecture (source)

Data lakes solve many data warehouse challenges, but it has poor data quality, and query performance isn’t efficient enough. Also, it requires additional tools to run SQL queries for business users. If a data lake isn’t organized well, it can lead to a data stagnation problem

Data lakehouses

Data lakehouse is a realtively new architecture and it is combining the best of the both worlds — data warehouses and data lakes. It serves as a single platform for data warehousing and data lakes. It has data management features such as ACID transcation coming from a warehouse perspective and low cost storage like a data lake. It provide direct access to the source data, allow concurrent read & write operations on the data, and schema support for data governance.

All the data — strucutured, semi-strucuted, unstrucuted, is stored in the data lake without doing any processing and it will be used by other processes for specific use-cases. Later different processing tools is used to build the specific use-case on the data. Futhermore, performance optimizations such as indexing, data compactions helps to achieve faster query results similar to a data warehouse. It also supports the streaming data so it can update the reporting dashboards in real-time.

Lakehouse architecture (source)

A data lakehouse offers improved data reliability by reducing the ETL data transfers but offering raw data storage. Data will not be duplicated in multiple systems. Due to reduced ETL processes and de-duplication cost is also reduced. Futhermore, it also offers better data management and opens up the data for multiple use cases.

Some of the key features of the data lakehouse are as follows.

  • ACID transaction support — ACID (Atomicity, concurrency, isolation, and durability) ensures transaction consistency and data integrity. It helps to maintain the data integrity when different components are doing concurrent operations or in case of failures. It’s a fundamental property of a data warehouse and is inherited into a data lakehouse.
  • Raw or unstructured data format — A data warehouse only supported structured, but now we have support for raw data types, including audio, video, etc.
  • Streaming support — Now, data is generated like an unbounded stream, so the data lake has support for streaming the data and generating insights in real-time.
  • Decoupled storage and compute — Storage and compute are decoupled, making them independently scalable as per the use case needs. Also, it allows you to run queries using different computing nodes and while others are accessing the storage directly.

In conclusion, data warehouses have existed for a while and matured, but they aren’t designed for modern data processing needs. On the other hand, data lakes solve most of the challenges but take away some of the best features of the data warehouses. Therefore, data lakehouse came into the picture and brought the best of both worlds. However, Data lakehouse architecture is still relatively new, and it’s going to take some time to get it mature and best practices being shared by the early adopters. In the meantime, Data warehouses and Data lakes have still been implemented for specific use cases, and in most cases, they co-exist and complement each other quite well to solve the problem at hand.

References

  1. https://www.altexsoft.com/blog/data-lakehouse/
  2. https://www.datagrom.com/data-science-machine-learning-ai-blog/snowflake-vs-databricks
  3. https://dataaiengine.com/

--

--

M Haseeb Asif
Big Data Processing

Technical writer, teacher and passionate data engineer. Love to talk, write and code with Apache Spark, Flink or anything related to data