The new generation storage solutions

Lakehouses: The Next Step in the Evolution of Storage Solutions

--

In this post, I am gonna talk about the LakeHouse, the optimal storage solution for todays world. Let’s start with the history.

Characteristics of Optimal Storage Solution

When we want to build a system we typically need those types of features for storage:

  • Transaction Support: We often need for reading and writing data concurrently, so support for ACID is essential.
  • Scalability and Performance: The storage solution should be able to scale out to store huge amounts of data and the latency needs to be acceptable level.
  • Diverse Data Formats: Data is everywhere and can be in any format. Structured, semi-structured on unstructured. So, the storage solution should be able to support those formats.

Over time, different kinds of storage solutions have been proposed, each with its unique advantages and disadvantages with respect to these properties. Let’s define the databases.

Databases

For many decades, databases have been the most reliable solution for building data warehouses to store business-critical data.

Databases are designed to store structured data as tables, which can be read using SQL queries. The data must adhere to a strict schema, which allows a database management system to heavily co-optimize the data storage and processing. RDBMS support strong transactional ACID guarantees which feels us good.

SQL workloads on databases can be broadly classified into two categories, as follows:

  • Online transaction processing (OLTP): OLTP workloads are typically high-concurrency, low-latency, simple queries that read or update a few records at a time. We can think RDBMS in this category.
  • Online analytical processing (OLAP): OLAP workloads are typically complex queries (involving aggregates and joins) that require high-throughput scans over many records.

Well, it looks good. But databases has some limitations.

Limitations of Databases

Up until the last decade RDBMS was fine. It gives us what we need. But last decade the data starts to grow.

In 2010 there were 5 billion mobile phones in use.You can be sure that there are more today and as I’m sure you will understand, these phones and the apps we install on them are a big source of big data, which all the time, every day, contributes to our core. And Facebook, which recently just set a record of having one billion people login in a single day, has more than 30 billion pieces of content shared every month. This numbers from 2013. Can you imagine today?

So, this problem changes the technology to become a data driven. Our solutions had to be evolve to be able to handle those data.

Databases are designed to work on single machine: I am talking about the RDBMS as you may guess. They typically work on single machine because they were designed for that and they make their job well. But we can not be able to fit data into a single machine today or process it in single machine. So, what we had to do? We should seperated data across multiple machine which we call cluster, a hundreds of thousand of machines. But hang on, the databases were not designed for that. How can the databases support ACID now. When you think about the distributed systems there are lots of things to consider.

Databases do not support non–SQL based analytics very well: The other problem is we have not just structured data today. Text, images, videos, json or xml files, lots of data types we need to query. Yet SQL can not give us what we need. We also need to change the way to query those types of data.

So, this limitations of databases led to the development of a completely different approach to storing data, known as data lakes.

Data Lakes

In contrast to most databases, a data lake is a distributed storage solution that runs on commodity hardware and easily scales out horizontally.

The data lake architecture, unlike that of databases, decouples the distributed storage system from the distributed compute system. This allows each system to scale out as needed by the workload. Furthermore, the data is saved as files with open formats, such that any processing engine can read and write them using standard APIs. This idea was popularized in the late 2000s by the Hadoop File System (HDFS) from the Apache Hadoop project, which itself was heavily inspired by the research paper “The Google File System” by Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. Doug Cutting is the person who is one of the creator of the Hadoop. He is also creator of the Apache Lucene. I love this guy , if you don’t know him please research his works :)

So, Hadoop changed everyting. It’s all started with Hadoop. Ecosystem evolves based on the Hadoop.

Organizations build their data lakes by independently choosing the following:

  • Storage system: They choose to either run HDFS on a cluster of machines or use any cloud object store (e.g., AWS S3, Azure Data Lake Storage, or Google Cloud Storage).
  • File format: Depending on the downstream workloads, the data is stored as files in either structured (e.g., Parquet, ORC), semi-structured (e.g., JSON), or sometimes even unstructured formats (e.g., text, images, audio, video).
  • Processing engine(s): Again, depending on the kinds of analytical workloads to be performed, a processing engine is chosen. This can either be a batch processing engine (e.g., Spark, Presto, Apache Hive), a stream processing engine (e.g., Spark, Apache Flink), or a machine learning library (e.g., Spark MLlib, scikit-learn, R).

This flexibility — the ability to choose the storage system, open data format, and processing engine that are best suited to the workload at hand — is the biggest advantage of data lakes over databases. On the whole, for the same performance characteristics, data lakes often provide a much cheaper solution than databases. This key advantage has led to the explosive growth of the big data ecosystem.

Limitations of Data Lakes

Is everything good with data lakes? The answer is no. It has also some limitations. When you think about distributed systems there is always a problem arises to solve it.

The most important limitation is the ACID support. Atomicity, Isolation, Consistency.

Processing engines write data in data lakes as many files in a distributed manner. If the operation fails, there is no mechanism to roll back the files already written, thus leaving behind potentially corrupted data.

Lack of atomicity on failed writes further causes readers to get an inconsistent
view of the data. In fact, it is hard to ensure data quality even in successfully written data. For example, a very common issue with data lakes is accidentally writing out data files in a format and schema inconsistent with existing data.

To work around these limitations of data lakes, developers employ some tricks like:

  • The schedules of data update jobs (e.g., daily ETL jobs) and data querying jobs(e.g., daily reporting jobs) are often staggered to avoid concurrent access to the data and any inconsistencies caused by it.
  • Large collections of data files in data lakes are often “partitioned” by subdirectories based on a column’s value (e.g., a large Parquet-formatted Hive table partitioned by date). To achieve atomic modifications of existing data, often entire subdirectories are rewritten (i.e., written to a temporary directory, then references swapped) just to update or delete a few records.

Attempts to eliminate such practical issues have led to the development of new systems, such as lakehouses.

Lakehouses: The Next Step in the Evolution of Storage Solutions

The lakehouse is a new paradigm that combines the best elements of data lakes and data warehouses for OLAP workloads. Lakehouses are enabled by a new system design that provides data management features similar to databases directly on the low-cost, scalable storage used for data lakes. More specifically, they provide the following features:

  • Transaction support: Similar to databases, lakehouses provide ACID guarantees in the presence of concurrent workloads.
  • Support for diverse data types in open formats
  • Schema enforcement and governance: Lakehouses prevent data with an incorrect schema being inserted into a table, and when needed, the table schema can be explicitly evolved to accommodate ever-changing data. The system should be able to reason about data integrity, and it should have robust governance and auditing mechanisms.
  • Support for upserts and deletes: Complex use cases like change-data-capture (CDC) and slowly changing dimension (SCD) operations require data in tables to be continuously updated. Lakehouses allow data to be concurrently deleted and updated with transactional guarantees.

Currently, there are a few open source systems, such as Apache Hudi, Apache Iceberg, and Delta Lake, that can be used to build lakehouses with these properties. At a very high level high level, all three projects have a similar architecture inspired by well-known database principles. They are all open data storage formats that do the following:

  • Store large volumes of data in structured file formats on scalable filesystems.
  • Maintain a transaction log to record a timeline of atomic changes to the data(much like databases).
  • Use the log to define versions of the table data and provide snapshot isolation guarantees between readers and writers.
  • Support reading and writing to tables using Apache Spark.

Apache Hudi

Initially built by Uber Engineering, Apache Hudi — an acronym for Hadoop Update Delete and Incremental — is a data storage format that is designed for incremental upserts and deletes over key/value-style data. The data is stored as a combination of columnar formats (e.g., Parquet files) and row-based formats (e.g., Avro files for recording incremental changes over Parquet files). Besides the common features mentioned earlier, it supports:

  • Upserting with fast, pluggable indexing
  • Atomic publishing of data with rollback support
  • Reading incremental changes to a table

Apache Iceberg

Originally built at Netflix, Apache Iceberg is another open storage format for huge data sets. However, unlike Hudi, which focuses on upserting key/value data, Iceberg focuses more on general-purpose data storage that scales to petabytes in a single table and has schema evolution properties. Specifically, it provides the following additional features (besides the common ones):

  • Rollback to previous versions to correct errors
  • Schema evolution by adding, dropping, updating, renaming, and reordering of columns, fields, and/or nested structures
  • Serializable isolation, even between multiple concurrent writers

Delta Lake

Personally this is my favourite because its designed from Spark engineers and I like Spark :)

Delta Lake is an open source project hosted by the Linux Foundation, built by the original creators of Apache Spark. Similar to the others, it is an open data storage format that provides transactional guarantees and enables schema enforcement and evolution. It also provides several other interesting features, some of which are unique. Delta Lake supports:

  • Streaming reading from and writing to tables using Structured Streaming sources and sinks
  • Update, delete, and merge (for upserts) operations, even in Java, Scala, and Python APIs
  • Schema evolution either by explicitly altering the table schema or by implicitly merging a DataFrame’s schema to the table’s during the DataFrame’s write. (In fact, the merge operation in Delta Lake supports advanced syntax for conditional updates/inserts/deletes, updating all columns together, etc., as you’ll see later in the chapter.)
  • Time travel, which allows you to query a specific table snapshot by ID or by timestamp
  • Rollback to previous versions to correct errors
  • Serializable isolation between multiple concurrent writers performing any SQL, batch, or streaming operations

PS: This project is called Delta Lake because of its analogy to streaming. Streams flow into the sea to create deltas — this is where all of the sediments accumulate, and thus where the valuable crops are grown. Jules S. Damji (one of our coauthors) came up with this!

Apache Spark and Delta Lake works well together. I am gonna post an article about this integration.

That’s it for this post. I hope you enjoy it :)

See you

Resources

https://delta.io/

https://hudi.apache.org/

--

--

Cihad Güzel
Big Data, Cloud Computing and Distributed Systems

AWS Certified Solutions Architect | Datastax Certified Apache Cassandra Developer. Big fan of distributed systems.