How to Use Delta Sharing with Google Colab

Delta Sharing is the industry’s first open protocol for secure data sharing, making it simple to securely share massive amounts of data with other organizations regardless of which computing platforms or cloud storage they use.

Frank Munz
Geek Culture
4 min readAug 12, 2021

--

Delta Sharing

Delta Sharing is a Linux Foundation open source framework. Picture it as a modern way of sharing massive amounts of live data from your data lake. On-premises, in the cloud, or hybrid. With basically any kind of receiver that is supporting pandas or Spark, so there is no vendor lock-in.

Delta Sharing 0.2.0

At the time of this writing, version 0.2.0 of Delta Sharing was released.

Getting Started

As a quick smoke test, let me show you how to create a client (sometimes also called a ‘receiver’) that is reading data from an existing Delta Sharing server.

I enjoy being biased :-), so I recommend using a Databricks workspace for this. Alternatively, you can also use Amazon EMR, Google Dataproc, Google’s Data Colab, a Jupyter notebook on your laptop, or a plain old standalone Python program.

Open Protocol / No vendor lock-in

Do you remember those days when Kelsey (working at Google) did the AWS Lamda demo at CNCF? Time to payback. So to prove my point that Delta Sharing is not linked to Databricks, I will go for Google Colab. My client implementation will be talking to a demo server hosted by Databricks containing public datasets.

Access or Create a Delta Sharing Server

If you replicate and the example below, simply point your code to the same demo server hosted by Databricks to get started easily.

Of course, you could also spin up your own Delta Sharing server. That is easy since Databricks provides a reference implementation of the server. With version 0.2.0 of Delta Sharing, the sharing server is also available as a Docker image on Docker hub.

The third and certainly most comfortable option is to simply share data from your Databricks workspace.

Step by Step Instructions

To get started, go to Google Colab, and open a new notebook.

In the first cell of the notebook, install the delta-sharing Python package using pip so the package can be used in the notebook:

!pip install delta-sharing

Then, in the second cell, with a few lines of code you can create a Delta Sharing client pointing to the server endpoint defined in the profile_file, list all the shared tables, load data into a pandas data frame, and display a filtered subset of the data.


import delta_sharing
# Point to the profile file, location also works with http(s)
profile_file = “https://raw.githubusercontent.com/delta-io/delta-sharing/main/examples/open-datasets.share"
# Create a SharingClient.
client = delta_sharing.SharingClient(profile_file)
# List all shared tables.
print(client.list_all_tables())
# load data as pandas dataframe (or Spark)
table_url = profile_file + “#delta_sharing.default.owid-covid-data”
data = delta_sharing.load_as_pandas(table_url)
# display filtered data
print(data[data[“iso_code”] == “USA”].head(10))

After running the code, your output should look as follows:

Delta Sharing

Data Mesh

There is a lot of talking these days about data meshes. In this article, I would like to cut it down to two core building blocks of data meshes.

Note, how simple it is to create access to various external data lakes with a single line of code using the Delta sharing client:

client1 = delta_sharing.SharingClient(profile_file1)

Data Governance

However, a data mesh is more than just being able to technically share data externally. A data mesh is much more about what is happening within your enterprise. It’s about how you govern various data lakes in your company.

This is where another product, Unity Catalog enters the game. It is the missing piece, that lets you audit, secure and manage access to your data for all your workspaces and across clouds. It works on databases, tables, views, rows, and columns, using standard SQL, instead of using low-level, complex, and cloud-specific IAM roles with files.

So the takeaway for you is sharing data from your Lakehouse is easy with Delta Sharing. And Delta Sharing and Unity Catalog are playing well together. Happy to cover that in more detail in another posting, let me know if you are interested!

More?

Please clap for this article or share it on social media if you enjoyed reading it. You can follow me for more data science, data engineering, or AI/ML-related news on twitter: @frankmunz.

--

--

Frank Munz
Geek Culture

Cloudy things, large-scale data & compute. Twitter @frankmunz. Former Tech Evangelist @awscloud, Principal @Databricks now. personal opinions here. #devrel ❤️.