Lakehouse Sharing: Simplifying Data Sharing Across Table Formats

Good things are meant to be shared, especially when it comes to things like data!

Antonio Murgia
Agile Lab Engineering
7 min readSep 18, 2023

--

What’s the point of having a nice and curated Data Lakehouse full of data that is hard to share with your colleagues and partner companies? It’s 2023, shouldn’t this be a piece of cake by now?
Fueled by this spirit and these burning questions, I’ve been diving into Lakehouse Sharing. Because hey, let’s make data sharing as cool and effortless as possible!

A beautiful lake, to be shared with friends, up high in the Dolomites

What is Lakehouse Sharing?

Lakehouse Sharing is an open-source implementation and extension of Delta Sharing protocol. The goals of Lakehouse Sharing are:

  • Prove that the Delta Sharing protocol can be implemented and adapted easily to Table Formats other than Delta Lake
  • Offer a practical solution for implementing authentication, authorization, shares, and permission definition. It’s important to note that the reference implementation has completely static permission definitions.

What is Delta Sharing?

If you have never heard about Delta Sharing, it’s pitched as “An open standard for secure data sharing: Delta Sharing is the industry’s first open protocol for secure data sharing, making it simple to share data with other organizations regardless of which computing platforms they use.”

As a companion to Delta Lake, Databricks offers Delta Sharing as part of their cloud offering, and a variety of third-party tools support it as of today:

As well as various libraries available for different programming languages (Java, Python, Go, NodeJs, R, C++, Rust).

Our views on the Delta Sharing Protocol

At Agile Lab, we strongly believe that the Delta Sharing protocol is an excellent concept with the potential to serve as a key enabler for numerous data initiatives and platforms, including those that are not specifically built on the Databricks platform.

While I won’t delve into the intricacies of Delta Sharing architecture, as it is beyond the scope of this discussion, I would like to highlight a couple of architectural choices that have been made:

One interesting architectural choice of Delta Sharing is its utilization of modern object storage services’ capabilities. It acts as a single layer for authentication and authorization, validating access requests and supplying clients with “ready-to-use” pre-signed URLs. This approach eliminates the need for redundant data movement and ensures that the client only requires the capability of reading a Delta Lake table and authenticating solely with Delta Sharing. Consequently, access control becomes significantly simpler.

Delta Sharing protocol is built around the principles of Delta Lake. However, it is designed to allow for practical application with various table formats, including Iceberg. This flexibility ensures that a useful subset of the protocol can be leveraged across different table formats, expanding its applicability beyond just Delta Lake.

I would like to give a huge shout-out to Gurunath for his exceptional effort in shaping Lakehouse Sharing. It’s evident that he has taken a remarkable initial stride by laying the foundations of a tool that has the potential to become a pivotal element in data platforms and architectures.

As Gurunath openly acknowledges, Lakehouse Sharing is currently in its infancy as a highly promising proof of concept. To truly flourish, however, it requires a transition from being a project led by a single individual to becoming a community-driven endeavour. In terms of system architecture, Lakehouse Sharing excels by providing lightweight sharing for Iceberg and Delta tables. Moreover, it acts as a proxy, completely decoupling the reader from the Iceberg catalog, further enhancing its capabilities and versatility.

Unfortunately, it isn’t all puppy dogs and rainbows.

Data Sharing Challenges I’ve encountered with Lakehouse Sharing

Delta-sharing protocol was born with Delta Lake in mind. Therefore, its API is hugely influenced by how Delta Lake works.

When limited to Copy-on-Write (CoW) tables and Parquet format, clients can be unaware of the underlying table format because they simply read Parquet files. That’s not the case when dealing with Merge-on-Read (MoR) tables, because so-called “delete files” come into play (for a deep dive on CoW vs. MoR see this excellent article).

Furthermore, certain features like z-ordering, which are available in both Iceberg and Delta Lake, require specific handling depending on the underlying table format. When abstracting over multiple technologies, there are two options: either restrict the features to the smallest common subset or address each technology individually, which, while maintaining the purpose of adapters, imposes a significant maintenance burden. It’s an inherent challenge when dealing with multiple technologies and ensuring compatibility without sacrificing functionality.

Currently, it’s hard to pick sides in the Iceberg vs Delta (vs Hudi) war and it’s not what we plan to do here. We still believe that common ground between the two (or three) table formats can exist.

Another challenge we encountered while assessing Lakehouse Sharing is its implementation in Python. Apart from the occasional lighthearted remarks about Python’s suitability for large-scale projects, the primary concern lies in the current (though not permanent) situation where the Table Format scenario is initially developed for the JVM (Java Virtual Machine) with subsequent release of adapters for other languages like Python and Rust. As a result, the Python libraries tend to lag compared to their Java counterparts, posing a significant disparity in terms of available features and functionality.

Another current limitation of Lakehouse Sharing is its support for only one table format at a time. This means that if you intend to share both Delta and Iceberg tables, you will need separate instances of the server for each format. It poses a restriction in terms of simultaneous sharing capabilities across multiple table formats, necessitating the management of multiple server instances.

Why Are Features Such as UniForm Not Enough?

Recently Delta Lake 3.0 was released, and it introduced UniForm. Quoting Databricks blog “Teams that use query engines designed to work with Iceberg or Hudi data will be able to read Delta tables seamlessly, without having to copy data over or convert it. Customers don’t have to choose a single format, because tables written by Delta will be universally accessible by Iceberg and Hudi readers. UniForm takes advantage of the fact that all three open lakehouse formats are thin layers of metadata atop Parquet data files. As writes are made, UniForm will incrementally generate this layer of metadata to spec for Hudi, Iceberg and Delta.”

Although this is a tremendous feat of engineering and I personally love that Databricks and the Delta Lake community are focusing so much on interoperability, Uniform still has two huge limitations:

  1. Limitations of Data Producer Constraints
    One of the primary challenges lies in the mandatory usage of Delta Lake for data producers. While this ensures consistency within the ecosystem, it does come with a caveat. By adopting this approach, unique features provided by alternatives such as Iceberg or Hudi become inaccessible. Consequently, users are restricted to leveraging only the features native to Delta Lake.
  2. Divergence in Open-Source Implementation
    Delta Lake is open source, which encourages community contributions and fosters innovation. However, Databricks runtime users must be aware that the version of Delta Lake provided differs from the open-source variant. This disparity may create compatibility issues when trying to consume Delta tables from external sources outside of the Databricks environment.

For sure UniForm improves interoperability in Data Lakehouse architecture. But it does so only at the advantage of Delta Lake. Databricks’ endgame here is to make Delta Lake become the underlying data storage solution and let any Iceberg or Hudi consumer, practically consume a Delta Lake table in disguise.

The Caveat with Unity Lakehouse Federation

Again, Databricks “went BIG” during Data + AI summit 2023 by announcing “Unity Lakehouse Federation.” This announcement represents the logical progression following Databricks’ acquisition of Okera.

Databricks users should be thrilled with this news, as Unity Lakehouse Federation offers several advantages.

It enables Databricks users to register diverse data sources in Unity and subsequently apply access control, masking, and row-level governance seamlessly, as if these sources were native Unity tables. This level of integration is achievable because Databricks exercises control over the runtime, specifically the Apache Spark distribution, allowing it to apply filters and transformations on-the-fly before the data is manipulated by user code. The approach taken by Okera, Privacera, and Immuta, (Databricks connectors) is quite similar.

However, there is a caveat here as well.

By adopting Unity Lakehouse Federation, you effectively become tied to Databricks as your sole means of accessing the data. Governance policies applied within Databricks may not carry over to other platforms such as Dremio or Trino if you were to access the same data from them. This limitation might be addressed in the future by implementing improvements like policy pushdown to federated data sources. This enhancement could ensure consistent enforcement of governance policies regardless of where the data is accessed.

Consequently, the need to maintain redundant policy definitions across various governance tools and platforms would be eliminated.

For now, this development is excellent news for Databricks users, who gain a portion of Okera’s offerings as part of Databricks itself. However, non-Databricks customers will continue relying on governance platforms like the ones mentioned earlier (Okera, Privacera, Immuta) to meet their governance needs.

What We Are Exploring in The Future

In our quest to contribute a reliable data sharing mechanism to the community, we are currently exploring a couple of approaches to enable the sharing of tables in a Table Format agnostic manner:

  1. Adopting Delta Sharing as the protocol:
    - We are considering contributing to or forking the existing Lakehouse Sharing implementation to address the issues we encountered.
    - Additionally, we aim to contribute to or fork the Delta Sharing clients, enabling support for advanced features of both Delta Lake and Iceberg.
  2. Proposing a new protocol inspired by Delta Sharing:
    - This new protocol would not be built exclusively on top of the Delta API but would accommodate both Delta and Iceberg, as well as future Table Formats.
    - We plan to design the protocol to address real-world requirements, such as the ability to define shares, users, recipients, and other necessary elements at runtime.
    - To support the underlying table format, we will maintain clients for this new protocol in major languages and frameworks.

By exploring these avenues, we aim to contribute a trustworthy and versatile tool that facilitates seamless sharing of tables, regardless of the specific Table Format used.

--

--

Antonio Murgia
Agile Lab Engineering

Computer science engineer, scala addict, coding enthusiast, distributed systems lover... I also like beers, wine and good food. Quite a good chef after all.