Deploy a Delta Sharing Server on Azure
If you’ve been following along in this series, we’ve previously deployed a Delta Sharing server on AWS. Providing a similar tutorial for deploying a Delta Sharing server on Azure is only fitting. So let’s get started!
Prerequisites
Before we jump into things, let’s double-check that you have everything you’ll need to launch a Delta Sharing server in the Azure cloud.
- Azure account — You’ll need an Azure account and subscription to follow along with the examples in this article. Don’t worry if you don’t already have an account. It’s free to signup and you’ll have access to free-tier eligible resources for 12 months. At the time of this writing, Microsoft is even offering a $200 credit for Azure services within the first 30 days.
- Postman (optional) — We’ll use Postman to send a few API requests to our Delta Sharing server. Postman provides a friendly UI for sending API requests. However, a simple cURL request from a terminal will work just as well.
A Simple Data Sharing Architecture
Delta Sharing is a simple protocol for securely sharing large datasets in real-time. As such, the architecture is simple as well, consisting of two major parts:
- A data provider — The data provider shares one or more datasets. The data provider can control what subset of the data is shared, with who the data is shared, and for how long.
- A data recipient — The data recipient receives access to the datasets. The data recipient can use a Delta Sharing client in the language or BI tool they prefer to access the shared data.
Create a Shared Dataset
Step 1: Create a Resource Group
Let’s begin by logging into the Azure portal. We’ll first need to create a new resource group which will organize all of the resources for the Delta Sharing server. From the top menu bar select “Resource groups” or from the search bar search for “Resource groups”. Next, click on “+ Create” to begin creating a new resource group.
In the resource group name text box, enter a meaningful name like “delta_sharing_rg”. Click “Review + create” button at the very bottom.
Once the validation checks pass, click on the “Create” button at the very bottom left to create the resource group. Finally, once the resource group has been created, navigate to the resource group by clicking on the “Go to resource group” button.
Step 2: Create a Storage Account
Next, we’ll create a new storage account to store the shared datasets. Feel free to use an existing storage account if you already have an Azure Data Lake Storage (ADLS) Gen2-compatible storage account.
From the top search bar, search for “Storage accounts” or select “Storage accounts” Azure service from the top menu bar of the portal home screen. Click on “+ Create” to begin creating a new storage account. Next, select resource group you created in the previous step from the drop-down. Enter a meaningful name in the storage account name text box, like “deltasharingdatasets”. Click the “Next: Advanced >” button to the bottom right of the dialog box.
On the “Advanced” page, select the checkbox to enable a hierarchical namespace. ADLS Gen2 is recommended for all big data workloads and boosts query performance.
Click the “Review” button to accept the remaining defaults. Once all validation checks pass, create the storage account by clicking on the “Create” button on the bottom left. Next, navigate to the newly created storage account.
Step 3: Create a Storage Container
After navigating to the newly created storage container, click on “Containers” to navigate the storage containers. Next, click “+ Container” to create a new storage container. Give the storage container a meaningful name like “shareddatasets”.
Click the “Create” button on the bottom to create a new storage container.
Step 4: Upload a Sample Dataset
Next, we’ll upload a sample dataset to the newly created storage container. If you don’t already have a dataset to share, feel free to use the sample dataset that accompanies this article by cloning the GitHub repo.
From the Azure portal, navigate to the storage container created in the previous step. Click on the “Upload” button to open the upload dialog screen. Next, navigate to the location of the sample Delta table after cloning the dataset or select your own existing Delta table to upload. Click the upload button.
Note: At the time of this writing, the Delta Sharing protocol only supports sharing datasets stored using the Delta Lake format.
Launch a Virtual Machine
One of the more powerful design choices of Delta Sharing is that none of the shared data traverses the sharing server. Instead, the Delta Sharing server generates a list of pre-signed file URLs that answer a Delta Sharing query. Consequently, the Delta Sharing server can be a relatively small virtual machine.
Step 5: Launch a Virtual Machine for the Sharing Server
From the Azure portal home screen, navigate to the Virtual Machines service by clicking on “Virtual machines” at to top navigation bar or by searching for “Virtual machines” in the search bar at the top.
Next, click the “+ Create” button to begin creating a new virtual machine (VM). Select the resource group you created in the first step. Give the virtual machine an appropriate name, such as “delta-sharing-server”. Select Ubuntu 20.04 LTS as the operating system and ensure that inbound port 22 is open on the virtual machine (we’ll open an SSH connection to the VM in the next step).
Lastly, select “Generate a new key pair” to download a private key for accessing the virtual machine in the next step.
Accept all the defaults by clicking on the “Review + create” button at the very bottom left. Once all the validation steps have passed click the “Create” button to provision the virtual machine.
Install the Delta Sharing Server
Step 6: Install the Latest Delta Sharing Release
Now that you’ve launched a new virtual machine, the next step is to open an SSH connection to the machine and install the latest Delta Sharing release.
First, navigate to the newly created virtual machine by clicking on the server name. Note the public IP address of the virtual machine (you will need the public IP address of the VM in later steps). Next, open a new bash terminal and start a new SSH connection to the VM, specifying the location of the key pair file (“.pem” file) downloaded in the previous step.
$ chmod 400 <path/to/keyfile>.pem
$ ssh -i <path/to/keyfile>.pem deltasharingadmin@<public IP address of the VM>
Before we get started, we’ll need to install a few updates, the unzip utility, and the Java Runtime Environment.
$ sudo apt-get update
$ sudo apt-get install zip unzip
$ sudo apt-get install openjdk-8-jdk
Next, use the wget utility to download the latest release of Delta Sharing and unzip the file contents. At the time of this writing, the latest release is Delta Sharing 0.6.2. Information regarding the Delta Sharing releases can be found on the public GitHub repo.
$ wget https://github.com/delta-io/delta-sharing/releases/download/v0.6.2/delta-sharing-server-0.6.2.zip
$ unzip delta-sharing-server-0.6.2.zip
Step 7: Update the Server Configuration
The server configuration defines important server attributes like the location of the shared datasets, the server endpoint, and an access token.
We’ll need to update the server configuration with the location of the Delta table on Azure Data Lake Storage. Begin by changing directories to the server conf directory. Next, rename the sample configuration file by removing the “.template” extension.
$ cd delta-sharing-server-0.6.2/conf
$ mv delta-sharing-server.yaml.template delta-sharing-server.yaml
$ vim delta-sharing-server.yaml
Next, open the server configuration file using your favorite editor and replace the default shares configuration block with the following. Ensure that you’ve updated the location with the storage container and any sub-folders created in the first step.
shares:
- name: "world-cup-share"
schemas:
- name: "default"
tables:
- name: "world-cup-2022-dataset"
location: "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<subfolder(s)>"
Lastly, add an access token by adding an entry for “bearerToken” at the end of the server configuration file. This access token will be used by data recipients to send API requests to the sharing server.
authorization:
bearerToken: “ajkexOH1H”
The final configuration file should look similar to the following:
# The format version of this config file
version: 1
# Config shares/schemas/tables to share
shares:
- name: "world-cup-share"
schemas:
- name: "default"
tables:
- name: "world-cup-2022-dataset"
location: "abfss://shareddatasets@deltasharingdatasets.dfs.core.windows.net/<subfolder(s)>"# Set the hostname that the server will use
host: "localhost"
# Set the port that the server will listen on. Note: using ports below 1024
# may require a privileged user in some operating systems.
port: 8080
# Set the URL prefix for the REST APIs
endpoint: "/delta-sharing"
# Set the timeout of S3 presigned URL in seconds
preSignedUrlTimeoutSeconds: 3600
# How many tables to cache in the server
deltaTableCacheSize: 10
# Whether we can accept working with a stale version of the table. This is useful when sharing
# static tables that will never be changed.
stalenessAcceptable: false
# Whether to evaluate user-provided `predicateHints`
evaluatePredicateHints: false
# The data recipient access token
authorization:
bearerToken: "ajkexOH1H"
Step 8: Set the Shared Access Key
Behind the scenes, the Delta Sharing server uses the hadoop-azure library to read a shared Delta table’s transaction log and data files from Azure Data Lake Storage. This library supports using a Shared Key to authenticate to ADLS Gen2.
As a result, you will need to add a new Hadoop configuration file, called a “core-site.xml”, which will contain the storage account key.
First, copy the storage account key by navigating to the storage account created in the second step and clicking on “Access keys” under the “Security + networking” section. Click on the “Show” button to reveal the value of “key1” and copy its value.
Next, back in your bash terminal, navigate to the sharing server’s conf directory and create a new file called “core-site.xml”.
$ touch core-site.xml
$ vim core-site.xml
Then add the following contents to the newly created “core-site.xml” file. Ensure that you’ve updated the storage account name and storage account key value.
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.azure.account.auth.type.STORAGE-ACCOUNT-NAME.dfs.core.windows.net</name>
<value>SharedKey</value>
<description>
</description>
</property>
<property>
<name>fs.azure.account.key.STORAGE-ACCOUNT-NAME.dfs.core.windows.net</name>
<value>STORAGE-ACCOUNT-KEY</value>
<description>
The secret password.
</description>
</property>
</configuration>
Step 9: Starting the Server
Finally, navigate back to the Delta Sharing server root directory and start the server by running the following command:
$ cd ..
$ bin/delta-sharing-server --config conf/delta-sharing-server.yaml
Step 10: Add an Inbound Port Rule
Before we can submit any API requests to the Delta Sharing server, we need to add a custom TCP rule that will allow incoming traffic on the port that the server is listening on. From the Azure portal, navigate to the Delta Sharing VM and click on the “Networking” tab. Click on the “Add inbound port rule” button and add a new custom TCP rule to allow all traffic on port 8080.
Click the “Add” button to add a new inbound port rule.
Sending a Delta Sharing Request
We can test the new Delta Sharing server by sending an API request.
Step 11: List all available tables
Begin by opening Postman and creating a new request. Click on the “Authorization” tab, select “Bearer token” from the drop-down, and enter the access token value you entered in the sharing server configuration file. Next, in the URL bar select “GET” as the HTTP method and enter the public IP address of the virtual machine followed by the port number, 8080 in this case, and the API endpoint for listing all tables under the “world-cup-share”. The full URL should look similar to the following:
http://<vm_public_ip_address>:8080/delta-sharing/shares/world-cup-share/all-tables
Step 12: Query the shared Delta table
Lastly, we can submit a sharing query request to retrieve a list of pre-signed file URLs. A Delta Sharing client, like the Python connector, will execute a similar API request when a function like load_as_pandas() is invoked and the Delta table is loaded as a Pandas DataFrame. From the request type drop-down, select “POST” as the HTTP method. Next, update the URL to consume the query table API endpoint. The URL should look similar to the following:
http://<vm_public_ip_address>:8080/delta-sharing/shares/world-cup-share/schemas/default/tables/world-cup-2022-dataset/query
Conclusion
Congratulations! You’ve now deployed your very own Delta Sharing server in Azure. By now, you should be able to:
- Upload a shared dataset on Azure Data Lake Storage Gen2
- Install the latest release of the Delta Sharing reference server
- Configure access to the shared dataset(s) using a Shared Access Key
- Submit API requests to the sharing server to list all shared tables and query the shared table files
While this example of the Delta Sharing server isn’t quite ready for production, you should now have a basic understanding of how the Delta Sharing server works at a high-level and understand the strengths of the Delta Sharing protocol.
Stay tuned for future articles where we’ll add security using an API gateway, add a custom domain name, and explore a serverless architecture for the Delta Sharing server. Thanks for reading!
Get Started with Delta Sharing for Python
Want to get started using the Delta Sharing connector for Python, but don’t know where to begin? Check out these quickstart examples using Delta Sharing for Python and Apache Spark.