Headless Lakehouse
Design considerations building a headless lakehouse
[Co-authored with Lee Hilton-Smith. Kudos to Umesh Pawar for his contributions.]
From the last couple of decades, we have seen evolutions in making (large scale) data available to consumers, from data warehouse (Azure Synapse, Redshift, Big Query, Snowflake or ClickHouse), through data lakes (ADLS Gen2, S3 or GCS), to the latest iteration, the data lakehouse (Delta Lake, Apache Hudi or Iceberg).
The data lakehouse pattern is now dominant in the industry and often tied to a specific vendor (e.g. Databricks), compute engine (e.g. Synapse Spark), storage (e.g. ADLS Gen2), and open table format (e.g. Delta Lake). While a single compute engine can be well integrated with different query engines, table formats, storage, and metadata stores, switching to another compute engine can be highly complex due to inherited dependencies.
In today’s lakehouse architecture, the primary compute engine (head) can determine architectural and capability decisions, leading to problems in centralized data lakes and data mesh topologies. Additionally, each data persona (data engineer data scientist, etc.) has a different relationship with data and tooling, with different requirements and desired outcomes, making a single head often not ideal. This leads to three major issues:
- Limited data and metadata interoperability: data and metadata is not always accessible from a secondary compute engine. This is a step back on data access but also on interoperability. For instance, querying Delta across Spark and Hive is not possible yet — Delta tables created from Spark (e.g. Azure Databricks) cannot be read from Hive (e.g. HDInsight) and vice versa. Also OSS HMS and HDInsight HMS are not fully compatible and schema changes are needed to make it work.
- Data access and governance: Users want to see table ownership metadata across systems. For instance, is Delta table metadata created from Azure Databricks available on Synapse? Similarly, how can users enable unified data access and governance across tables and datasets created from different compute engines? Unity Catalog is for Databricks, Purview for Synapse, or Apache Ranger for HDInsight.
- Security, monitoring and integration: Apart from data access and governance support, how well is security (e.g. RLS, CLS or dynamic data masking), monitoring (e.g. tooling and telemetry/alerting parity) and ETL integrated when moving to a secondary engine? How is this new compute integrated with the existing reporting (Power BI) or data science tooling (AzureML)?
What’s a Headless Lakehouse?
A headless lakehouse (aka configurable compute) can be defined as an unified data architecture that provides a seamless way to access and manage data across different compute systems, storage locations, and formats. It enables different systems and users to access, analyze and use the data easily, promoting agility and scalability in data management and analysis. That way, users looking for interoperable compute engines would have quick and smooth access to data, having their compute engine of choice. Additional capabilities include:
What users don’t want is data to be locked in a certain storage and/or metastore, and unavailable for other engines or to spend extra time and computing costs on data transfer and format conversion.
- Multiple compute capabilities supporting data access via multiple query engines and workloads. This can be serverless or provisioned.
- Unified data and metadata storage avoiding redundant storage, metadata and cross-system ETL.
- Unified management and governance with user management, access controls, data lineage and quality, schema evolution, and more.
- Unified security in the data storage and all compute engines enforcing it uniformly to queries and jobs.
- Unified collaboration and data sharing facilitating collaboration and data sharing across different teams, e.g. API based.
- Unified monitoring for logs, metrics and alerts including platform observability.
- Unified integration and ETL integrating with data from various sources to the unified data storage and making it available for consumers.
Multiple compute capabilities
In a Headless lakehouse pattern, multiple compute engines can lead to two different topologies. In both topologies, attention to the versions of engine, metastore and data format is critical for ensuring interoperability:
- Same Query Engine: The first topology is the “same query engine”. In this model (see image below), same query engine (e.g. Spark) is used across different compute engines (e.g. HDInsight and Synapse). Each compute engine has access to the underlying data and metastore.
- Cross Query Engine: The “cross-query engine” topology involves using multiple query engines (such as Spark and Hive) to access the same storage and metastore. This allows different compute systems (e.g. HDInsight and Databricks) to utilize the same data and metastore.
Unified data and metadata store
Users can keep their data in ADLS Gen2 and read that data from different compute engines. This allows for switching from one platform to another, making data interoperable across different compute engines. In a Headless lakehouse pattern, making such unified data storage accessible from multiple systems is critical. For instance, Synapse, HDInsight and Azure Databricks works well with ADLS Gen2.
Similarly, metadata can also be stored in a separate external metastore. With a shared HMS or synced mechanisms, such metadata can be accessible across multiple compute engines without requiring metadata migration or rebuilding the metastore from external tables.
- Shared Metastore: the shared HMS approach has a key advantage in being fully supported by the chosen compute engine (+vendor) so there is no requirement to build anything bespoke. This reduces the risk in terms of implementation and support, but it does limit some feature compatibility where vendors diverge from the metastore approach, as is the case with Databricks Unity Catalog. See this for more details.
- Synced Metastore: where there is appetite for some bespoke development, then a synced metastore approach may be more appropriate. In a synched model, there would be a series of metastores that master the different engines of the data estate. Those different metastores would then be synched by a bespoke process to provide visibility and interoperability across all the engines. For instance, this can be achieved using Azure SQL Data Sync on external HMS.
For any suggestions or questions, feel free to reach out :)
References:
- Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics (cidrdb.org)
- Data lakes — Azure Architecture Center | Microsoft Learn
- Data Mesh topologies. Design considerations for building a… | by Piethein Strengholt | Towards Data Science
- Using a Shared Hive Metastore Across Azure Synapse, HDInsight, and Databricks | by Aitor Murguzur | Mar, 2023 | Medium
- Building a Data Lakehouse Using Azure HDInsight | by Aitor Murguzur | Apr, 2023 | Medium