Sitemap

Deploy a Delta Sharing Server on Azure

11 min readJan 6, 2023

--

If you’ve been following along in this series, we’ve previously deployed a Delta Sharing server on AWS. Providing a similar tutorial for deploying a Delta Sharing server on Azure is only fitting. So let’s get started!

Fig. 1.1 — In this article, we’ll be covering the steps for deploying a Delta Sharing server on Azure.

Prerequisites

Before we jump into things, let’s double-check that you have everything you’ll need to launch a Delta Sharing server in the Azure cloud.

  • Azure account — You’ll need an Azure account and subscription to follow along with the examples in this article. Don’t worry if you don’t already have an account. It’s free to signup and you’ll have access to free-tier eligible resources for 12 months. At the time of this writing, Microsoft is even offering a $200 credit for Azure services within the first 30 days.
  • Postman (optional) — We’ll use Postman to send a few API requests to our Delta Sharing server. Postman provides a friendly UI for sending API requests. However, a simple cURL request from a terminal will work just as well.

A Simple Data Sharing Architecture

Fig. 1.2 — The Delta Sharing architecture consists of two parts —1.) the data provider, who shares the dataset(s), and 2.) the data recipient, who receives access to the shared dataset(s).

Delta Sharing is a simple protocol for securely sharing large datasets in real-time. As such, the architecture is simple as well, consisting of two major parts:

  1. A data provider — The data provider shares one or more datasets. The data provider can control what subset of the data is shared, with who the data is shared, and for how long.
  2. A data recipient — The data recipient receives access to the datasets. The data recipient can use a Delta Sharing client in the language or BI tool they prefer to access the shared data.

Create a Shared Dataset

Step 1: Create a Resource Group

Let’s begin by logging into the Azure portal. We’ll first need to create a new resource group which will organize all of the resources for the Delta Sharing server. From the top menu bar select “Resource groups” or from the search bar search for “Resource groups”. Next, click on “+ Create” to begin creating a new resource group.

Fig. 1.3— Create a new resource group by navigating to the Resource group Azure service and clicking “+ Create”.

In the resource group name text box, enter a meaningful name like “delta_sharing_rg. Click “Review + create” button at the very bottom.

Fig. 1.4 — Creating a resource group is a great way to organize all of the cloud resources for the Delta Sharing server.

Once the validation checks pass, click on the “Create” button at the very bottom left to create the resource group. Finally, once the resource group has been created, navigate to the resource group by clicking on the “Go to resource group” button.

Step 2: Create a Storage Account

Next, we’ll create a new storage account to store the shared datasets. Feel free to use an existing storage account if you already have an Azure Data Lake Storage (ADLS) Gen2-compatible storage account.

Fig. 2.1 — Navigate to the “Storage accounts” by clicking on the “Storage accounts” Azure service from the top menu bar or by searching for “Storage accounts” in the top search bar.

From the top search bar, search for “Storage accounts” or select “Storage accounts” Azure service from the top menu bar of the portal home screen. Click on “+ Create” to begin creating a new storage account. Next, select resource group you created in the previous step from the drop-down. Enter a meaningful name in the storage account name text box, like “deltasharingdatasets”. Click the “Next: Advanced >” button to the bottom right of the dialog box.

Fig. 2.2 — Create a new storage account in the resource group you created in the previous step.

On the “Advanced” page, select the checkbox to enable a hierarchical namespace. ADLS Gen2 is recommended for all big data workloads and boosts query performance.

Fig. 2.3 — Enable a hierarchical namespace for the new storage account to improve query performance.

Click the “Review” button to accept the remaining defaults. Once all validation checks pass, create the storage account by clicking on the “Create” button on the bottom left. Next, navigate to the newly created storage account.

Step 3: Create a Storage Container

After navigating to the newly created storage container, click on “Containers” to navigate the storage containers. Next, click “+ Container” to create a new storage container. Give the storage container a meaningful name like “shareddatasets”.

Fig. 3.1 — Create a storage container within the newly created storage account that will hold the shared datasets.

Click the “Create” button on the bottom to create a new storage container.

Step 4: Upload a Sample Dataset

Next, we’ll upload a sample dataset to the newly created storage container. If you don’t already have a dataset to share, feel free to use the sample dataset that accompanies this article by cloning the GitHub repo.

From the Azure portal, navigate to the storage container created in the previous step. Click on the “Upload” button to open the upload dialog screen. Next, navigate to the location of the sample Delta table after cloning the dataset or select your own existing Delta table to upload. Click the upload button.

Fig. 4.1 — Upload the sample Delta table that accompanies the blog article to the newly created storage container.

Note: At the time of this writing, the Delta Sharing protocol only supports sharing datasets stored using the Delta Lake format.

Launch a Virtual Machine

Fig. 5.1 — Behind the scenes, the Delta Sharing server uses a Shared Key authentication to access the data files on Azure Data Lake Storage.

One of the more powerful design choices of Delta Sharing is that none of the shared data traverses the sharing server. Instead, the Delta Sharing server generates a list of pre-signed file URLs that answer a Delta Sharing query. Consequently, the Delta Sharing server can be a relatively small virtual machine.

Step 5: Launch a Virtual Machine for the Sharing Server

From the Azure portal home screen, navigate to the Virtual Machines service by clicking on “Virtual machines” at to top navigation bar or by searching for “Virtual machines” in the search bar at the top.

Fig. 5.2. Navigate to the virtual machines service by clicking on “Virtual machines” in the top navigation bar or by searching for the service in the top search bar.

Next, click the “+ Create” button to begin creating a new virtual machine (VM). Select the resource group you created in the first step. Give the virtual machine an appropriate name, such as “delta-sharing-server”. Select Ubuntu 20.04 LTS as the operating system and ensure that inbound port 22 is open on the virtual machine (we’ll open an SSH connection to the VM in the next step).

Fig. 5.3 — Create a new virtual machine within the resource group created the very first step. The VM size can be left at the default VM type since the Delta Sharing server doesn’t need to be very large.

Lastly, select “Generate a new key pair” to download a private key for accessing the virtual machine in the next step.

Fig. 5.4 — Generate a new key pair that will be used to open an SSH connection to the virtual machine in the next step.

Accept all the defaults by clicking on the “Review + create” button at the very bottom left. Once all the validation steps have passed click the “Create” button to provision the virtual machine.

Install the Delta Sharing Server

Step 6: Install the Latest Delta Sharing Release

Now that you’ve launched a new virtual machine, the next step is to open an SSH connection to the machine and install the latest Delta Sharing release.

First, navigate to the newly created virtual machine by clicking on the server name. Note the public IP address of the virtual machine (you will need the public IP address of the VM in later steps). Next, open a new bash terminal and start a new SSH connection to the VM, specifying the location of the key pair file (“.pem” file) downloaded in the previous step.

$ chmod 400 <path/to/keyfile>.pem
$ ssh -i <path/to/keyfile>.pem deltasharingadmin@<public IP address of the VM>

Before we get started, we’ll need to install a few updates, the unzip utility, and the Java Runtime Environment.

$ sudo apt-get update
$ ​​sudo apt-get install zip unzip
$ sudo apt-get install openjdk-8-jdk

Next, use the wget utility to download the latest release of Delta Sharing and unzip the file contents. At the time of this writing, the latest release is Delta Sharing 0.6.2. Information regarding the Delta Sharing releases can be found on the public GitHub repo.

$ wget https://github.com/delta-io/delta-sharing/releases/download/v0.6.2/delta-sharing-server-0.6.2.zip
$ unzip delta-sharing-server-0.6.2.zip

Step 7: Update the Server Configuration

The server configuration defines important server attributes like the location of the shared datasets, the server endpoint, and an access token.

We’ll need to update the server configuration with the location of the Delta table on Azure Data Lake Storage. Begin by changing directories to the server conf directory. Next, rename the sample configuration file by removing the “.template” extension.

$ cd delta-sharing-server-0.6.2/conf
$ mv delta-sharing-server.yaml.template delta-sharing-server.yaml
$ vim delta-sharing-server.yaml

Next, open the server configuration file using your favorite editor and replace the default shares configuration block with the following. Ensure that you’ve updated the location with the storage container and any sub-folders created in the first step.

shares:
- name: "world-cup-share"
schemas:
- name: "default"
tables:
- name: "world-cup-2022-dataset"
location: "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<subfolder(s)>"

Lastly, add an access token by adding an entry for “bearerToken” at the end of the server configuration file. This access token will be used by data recipients to send API requests to the sharing server.

authorization:
bearerToken: “ajkexOH1H”

The final configuration file should look similar to the following:

# The format version of this config file
version: 1
# Config shares/schemas/tables to share
shares:
- name: "world-cup-share"
schemas:
- name: "default"
tables:
- name: "world-cup-2022-dataset"
location: "abfss://shareddatasets@deltasharingdatasets.dfs.core.windows.net/<subfolder(s)>"# Set the hostname that the server will use
host: "localhost"
# Set the port that the server will listen on. Note: using ports below 1024
# may require a privileged user in some operating systems.
port: 8080
# Set the URL prefix for the REST APIs
endpoint: "/delta-sharing"
# Set the timeout of S3 presigned URL in seconds
preSignedUrlTimeoutSeconds: 3600
# How many tables to cache in the server
deltaTableCacheSize: 10
# Whether we can accept working with a stale version of the table. This is useful when sharing
# static tables that will never be changed.
stalenessAcceptable: false
# Whether to evaluate user-provided `predicateHints`
evaluatePredicateHints: false
# The data recipient access token
authorization:
bearerToken: "ajkexOH1H"

Step 8: Set the Shared Access Key

Behind the scenes, the Delta Sharing server uses the hadoop-azure library to read a shared Delta table’s transaction log and data files from Azure Data Lake Storage. This library supports using a Shared Key to authenticate to ADLS Gen2.

As a result, you will need to add a new Hadoop configuration file, called a “core-site.xml”, which will contain the storage account key.

First, copy the storage account key by navigating to the storage account created in the second step and clicking on “Access keys” under the “Security + networking” section. Click on the “Show” button to reveal the value of “key1” and copy its value.

Fig. 8.1 — The Delta Sharing server will use the storage account access key to read a shared Delta table’s transaction log, and pre-sign the data file locations on ADLS Gen2.

Next, back in your bash terminal, navigate to the sharing server’s conf directory and create a new file called “core-site.xml”.

$ touch core-site.xml
$ vim core-site.xml

Then add the following contents to the newly created “core-site.xml” file. Ensure that you’ve updated the storage account name and storage account key value.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.azure.account.auth.type.STORAGE-ACCOUNT-NAME.dfs.core.windows.net</name>
<value>SharedKey</value>
<description>
</description>
</property>
<property>
<name>fs.azure.account.key.STORAGE-ACCOUNT-NAME.dfs.core.windows.net</name>
<value>STORAGE-ACCOUNT-KEY</value>
<description>
The secret password.
</description>
</property>
</configuration>

Step 9: Starting the Server

Finally, navigate back to the Delta Sharing server root directory and start the server by running the following command:

$ cd ..
$ bin/delta-sharing-server --config conf/delta-sharing-server.yaml

Step 10: Add an Inbound Port Rule

Before we can submit any API requests to the Delta Sharing server, we need to add a custom TCP rule that will allow incoming traffic on the port that the server is listening on. From the Azure portal, navigate to the Delta Sharing VM and click on the “Networking” tab. Click on the “Add inbound port rule” button and add a new custom TCP rule to allow all traffic on port 8080.

Fig. 10.1 — A custom inbound port rule is added to allow the Delta Sharing server to listen for incoming TCP requests on port 8080.

Click the “Add” button to add a new inbound port rule.

Fig. 10.2. — Add a custom inbound port rule for incoming TCP requests on port 8080.

Sending a Delta Sharing Request

We can test the new Delta Sharing server by sending an API request.

Step 11: List all available tables

Begin by opening Postman and creating a new request. Click on the “Authorization” tab, select “Bearer token” from the drop-down, and enter the access token value you entered in the sharing server configuration file. Next, in the URL bar select “GET” as the HTTP method and enter the public IP address of the virtual machine followed by the port number, 8080 in this case, and the API endpoint for listing all tables under the “world-cup-share”. The full URL should look similar to the following:

http://<vm_public_ip_address>:8080/delta-sharing/shares/world-cup-share/all-tables
Fig. 11.1 — An example HTTP GET request using Postman to list all tables under the “world-cup-share”.

Step 12: Query the shared Delta table

Lastly, we can submit a sharing query request to retrieve a list of pre-signed file URLs. A Delta Sharing client, like the Python connector, will execute a similar API request when a function like load_as_pandas() is invoked and the Delta table is loaded as a Pandas DataFrame. From the request type drop-down, select “POST” as the HTTP method. Next, update the URL to consume the query table API endpoint. The URL should look similar to the following:

http://<vm_public_ip_address>:8080/delta-sharing/shares/world-cup-share/schemas/default/tables/world-cup-2022-dataset/query
Fig. 12.1 — An example HTTP POST request using Postman for querying the shared Delta table “world-cup-2022-dataset”.

Conclusion

Congratulations! You’ve now deployed your very own Delta Sharing server in Azure. By now, you should be able to:

  1. Upload a shared dataset on Azure Data Lake Storage Gen2
  2. Install the latest release of the Delta Sharing reference server
  3. Configure access to the shared dataset(s) using a Shared Access Key
  4. Submit API requests to the sharing server to list all shared tables and query the shared table files

While this example of the Delta Sharing server isn’t quite ready for production, you should now have a basic understanding of how the Delta Sharing server works at a high-level and understand the strengths of the Delta Sharing protocol.

Stay tuned for future articles where we’ll add security using an API gateway, add a custom domain name, and explore a serverless architecture for the Delta Sharing server. Thanks for reading!

Get Started with Delta Sharing for Python

Want to get started using the Delta Sharing connector for Python, but don’t know where to begin? Check out these quickstart examples using Delta Sharing for Python and Apache Spark.

--

--

Will Girten
Will Girten

Written by Will Girten

Professional Data Lake Diver 🤿 | SSA @Databricks | Content Creator @Netrig Analytics | Follow me for the latest Data Lakehouse tuning tips!

Responses (3)