Pile of files to Lakehouse —A 10000 feet overview

Yunus Durmuş
Google Cloud - Community
4 min readJul 19, 2023

Data analytics space is pretty generous about inventing new terms like Data Lake, Data warehouse and Database. Then, when eventually these technologies start to merge, we do more clever naming; we concatenate them: Lakehouse.

The more we organize the files, the more they become queryable. BigData engines combine these file dumps and databases

The above picture, hopefully, summarizes the whole evolution. Let me explain them by words as well.

File formats

We start with lots of files with different formats. Unstructured ones are hard to work with. We mostly do simple search on them. Not good for doing proper analytics. Saying that you all heard about Large Language Models am I right? New tech like Google Cloud Enterprise Search makes it easier to search. Maybe after Lakehouse, we will talk about LanguageLakes😜.

The more we add structure to files, it gets easier to query them. Column based formats are ideal for analytics and storage. They keep a column’s data together in chunks so compression works better and it is easier to read whole column data for aggregation.

Data Lake

Let’s dump the above files as it is since we don’t know what to do and storage is cheap. With custom applications and big data engines like Spark, we can work on them.

But wait, there are too many of these files and I don’t know what is in them, what is their schema. It would be cool if I would be able to get some metadata and enforce some standards.

Table Formats

Table formats organize the files and provides some metadata. For instance, basic Hive partitioning format organizes files in directories where directory names are partition key=value pairs like gs://myBucket/myTable/dt=2019-10-31/lang=en/partitionKey

Advanced table formats like Iceberg, Delta, and BigQuery provide snapshots, time travel, schema evolution and many more.

Data Lakehouse

With proper table formats, we can now easily query those files with SQL. Since BigData engines are really capable, they can query not just these lakes but also SQL/NoSQL databases like MySQL, PostgreSQL, BigTable, Spanner etc.

With a Data Lakehouse, life is way easier for a data analyst or data scientist. They can query data residing in anywhere with a single tool.

Single Interface/Secure Lakehouse with BigLake API

Lakehouse enables us to query piles of data with SQLs and hence the integration with many other systems got easier. Meanwhile, the security team is in panic, they want to control who accesses to what. They want to enforce fine grained access control, dataset/table/row/column based.

BigLake API acts as an authentication and authorization layer in front of all these table formats. By using google cloud native IAM, you are able to control access to all the data.

BigLake has many other features, like metadata storage with file level tracking (in contrast to Hive Metastore which tracks folder level and hence, engines have to run slow ‘ls’ queries on object storages like GCS), single SQL syntax, partial data read support for Parquet file formats.

BigLake APIs act as a single point of entry to all file/table formats. It handles security from a single pane.

BigData engines

There are many options to process these files. The overall trend is distributed computation where we are able to increase the capacity by not just vertical but also horizontal scaling.

Google Cloud Dataproc supports many of these engines such as Spark, Trino, and Hive. You may even run Apache Spark as serverless. You simply submit a job without configuring the infrastructure at all.

BigQuery, as you see in the picture, is not just an engine. It has its own file and table formats. When you use all the native features of BigQuery, you end up in a serverless architecture.

Data Warehouse

I have not shown in the picture, but there is also data warehouse. What is the difference between a warehouse and lakehouse? Warehouses have their own file/table formats and mostly storage and computation are side by side in the same machine. Thus, they respond faster and maintain schema strictly.

On the other hand, BigQuery is a data warehouse by definition when it is used with its native storage. It becomes a lakehouse when querying data in other data storages like GCS/S3/Azure Blob/CloudSQL etc. Based on your performance requirements and features needed, you may gradually move data from lake level to warehouse level.

Summary

We have lots of data and we don’t want to get rid of them in case we need them. However, we are lost among all these data. We don’t know where they are, what is inside them, what is their schema etc. The solution is to bring some order, by using structured file & tables formats and introducing metastores. When we combine these with BigLake API, we also have a single interaction point with fine grained access control.

Now the data scientists and analysts can combine the data of whole company in a single dashboard/model.

--

--