Azure Synapse — How to use Delta Sharing ?
A couple days ago I wrote an article about how to setup Delta Sharing on Azure [here is the link], as I promised i will continue digging into the subject. Therefore I am sharing today this brief blog on how to use Azure Synapse Analytics to query a Lakehouse stored as Delta tables and shared by a Delta Sharing server.
Like you can see on the diagram above we will be using Azure Synapse Analytics Spark Pool as data recipient. Also we will take advantage of these two components:
+) Apache Spark Connector: An Apache Spark connector that implements the Delta Sharing Protocol to read shared tables from a Delta Sharing Server.
+) Python Connector: A Python library that implements the Delta Sharing Protocol to read shared tables as pandas DataFrame or Apache Spark DataFrames.
As my grand mother used to say It is very hard to shave an egg 😜, these two connectors need a couple of sytem requirements on Azure Synapse Spark Pool in order to figure out how to read the Delta Sharing tables.
REQUIREMENTS
+) Data Recipient side
For the Python connector: Python 3.6 +
For the Apache Spark connector: Java 8+, Scala 2.12.x, Apache Spark 3+
=> Hopefully Azure Synapse Analytics add this year the support of Spark 3.0 and moreover we can create a pool with the 3.1 version (see the image bellow) which garanties all requirements for the two connectors.
+) Data Provider side
You can take a look at my previous blog for more details on how to setup the Delta Sharing on Azure.
Once we have our data provider ready to treat the data recipient’ requests, we can start testing the two connectors.
SYNAPSE — APACHE SPARK CONNECTOR FOR DELTA SHARING
The documentation of the Delta Sharing project said that in order to use the Apache Spark connector we have to setup and run a maven/sbt project or launch the Spark Shell (PySpark/Scala) inetractively. Saddly none of the two options are suitable for our use case based on Synapse Analytics Spark Pool.
With the help of detective chump 😄 I found on the Synapse documentation that we can load Apache Spark packages from the Maven Repo to our Spark Pool:
+) Manually by downloading the jar files from the Maven Repo and attach them to the Azure Synapse Workspace (to be shared with all pools) or the Spark Pool directly. [see here for more details]
You can download the latest version of the jar file for the package io.delta:delta-sharing-spark_2.12:0.3.0 from here, after attached it to the workspace or the Spark Pool.
+) At runtime (at Spark session level) by using the Spark session config magic command %%configure. [see here for more details]
Copy and paste the instruction bellow in your notebook:
%%configure -f \{"conf": {"spark.jars.packages": "io.delta:delta-sharing-spark_2.12:0.3.0"}}
Once you have loaded the Apache Spark needed package, you can start submitting your queries to read data lake tables shared by the Delata Sharig server.
SYNAPSE — PYTHON CONNECTOR FIR DELTA SHRING
For the Python connector we will need just to install the delta_sharing Python library. Using the instruction bellow we can figure out if it’s already installed on the Spark Pool:
import pkg_resourcesfor d in pkg_resources.working_set:
print(d)
By default Apache Spark in Azure Synapse Analytics has a full set of libraries for common data engineering, data preparation, machine learning, and data visualization tasks. When a Spark instance starts up, these libraries will automatically be included. Moreover extra Python and custom-built packages can be added at the Spark pool and session level. [see here for more details].
Let’s test the Pool libraries installation (eq. at Spark pool level); you will have to provide a requirements.txt or Conda environment.yml environment specification to install packages from repositories like PyPI, Conda-Forge, and more:
Once the pool is ready and finished the installation of the Python library (could be monitored like this), you can start reading data from the Delta Lake already shred by the Delta Sharing server:
This is it for this article ! As always I am happy to respond to your questions and comments.