Deploy a Delta Sharing Server on AWS

Will Girten
10 min readNov 15, 2022

--

What is Delta Sharing?

Delta Sharing is the industry’s first open and secure data-sharing protocol for the real-time exchange of large datasets. Delta Sharing defines a protocol for data providers to share datasets with data recipients. Data recipients may then interact with the shared datasets using any data-sharing client that implements the Delta Sharing interface.

Fig. 1.1 — Delta Sharing defines an open protocol for data providers to securely share datasets with data recipients.

A Problem Worth Solving

For example, one way to share data would involve numerous IAM configuration steps to grant users access to read a shared dataset on Amazon S3. Not to mention, there are still additional steps that the data recipient, who is receiving access to the shared dataset, must complete before they can begin reading the data.

Fig. 1.2 — One method for sharing data with an individual user involves many steps to configure programmatic access to a dataset stored on Amazon S3.

For some individuals like data analysts, setting IAM access keys locally and interacting with a terminal is impractical. Plus, this solution for sharing data doesn’t scale. If you’re a lazy data engineer like myself, you loathe configuring multiple AWS profiles. Passing the — profile flag from the CLI is a pain.

Fig. 1.3 — Does your AWS config file look similar to mine? Having to keep track of multiple AWS profiles can be a real pain.

Wouldn’t it be great to bypass all the unfun configuration steps and simply load the dataset using a client in your preferred programming language? For example, loading a shared dataset as a Pandas DataFrame is super simple using Delta Sharing!

import delta_sharing

table_url = profile_file + "#tweets-share.default.tweets"
data = delta_sharing.load_as_pandas(table_url)

Prerequisites

Before we begin, you must have the following items to follow along with the steps for deploying a Delta Sharing server on AWS.

  1. AWS Account — You’ll need access to an AWS account. Don’t worry if you don’t already have one. It’s free to sign-up, and you’ll have access to resources under the “free tier” for 12 months.
  2. Bash terminal — You’ll need access to a bash terminal to open an SSH connection to an Amazon EC2 instance and install a Delta Sharing server.
  3. (Optional) Postman — We’ll use Postman to submit API requests to the Delta Sharing server. Postman provides a friendly UI. However, sending the requests via cURL works just as well too.

Create a Dataset to Share

One of the powerful strengths behind Delta Sharing is its ability to share datasets across multiple clouds. For example, a data provider could share a dataset stored on Amazon S3 while simultaneously sharing a dataset stored on Azure Data Lake Storage and another stored on Google Cloud Storage.

For simplicity’s sake, however, we’ll be sharing a single dataset stored on Amazon S3 within the same AWS account as the Delta Sharing server.

Step 1: Create a new S3 bucket

Let’s begin by creating a new S3 bucket. From the AWS console, navigate to the S3 console and click the “Create bucket” button to create a new S3 bucket. Next, give the bucket a unique name and accept the default settings by clicking the “Create bucket” button again.

Fig. 1.4 — We’ll store the shared dataset in a new S3 bucket for this demonstration.

Step 2: Upload a dataset to share

Next, we’ll upload a dataset to the newly created S3 bucket. If you don’t already have a dataset to share, you can use the sample Twitter dataset by cloning the accompanying GitHub repo.

From the S3 console, click on the name of the new S3 bucket and the upload button. Next, navigate to the location of the sample “tweets” Delta table and click the upload button. Finally, confirm the upload from the alert notification.

Fig. 2.1. — Upload the sample Delta table in the cloned repo to the S3 bucket. Select the parent folder containing the data files and the Delta transaction log and click the “Upload” button.

It’s important to note that at the time of this writing, the Delta Sharing protocol only supports sharing datasets stored in the Delta format.

Provision an EC2 instance for the Sharing Server

Step 3: Create an EC2 instance profile

In order for the Delta Sharing server to answer sharing queries, it will need permission to read the shared dataset in the S3 bucket. To do so, we’ll first create an IAM policy that grants read permissions and then we’ll attach the policy to a new EC2 role that our sharing server will assume.

From the AWS console, navigate to the IAM service. Under the “Access management” section select “Roles” and click the “Create role” button. Select “AWS service” as the trusted entity type and select “EC2'’ as the use case. Click the “Next” button. Click the “Create policy” button and select the “JSON” tab. Enter the following inline policy to grant the Delta Sharing server permission to read the shared dataset. Ensure that you update the Amazon S3 bucket name correctly.

{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Action": [
"s3:ListBucket",
"s3:GetObject",
"s3:GetObjectVersion"
],
"Resource": [
"arn:aws:s3:::<your_s3_bucket_name>",
"arn:aws:s3:::<your_s3_bucket_name>/*"
]
}]
}

Select the “Next” button. Give the policy a unique name and an optional description. Finally, click the “Create policy” button.

Fig. 3.1. — We’ll first create an IAM policy that will grant permission to get and list objects in the newly created S3 bucket.

You will be redirected back to the create IAM role wizard. Select the newly created policy and click the “Next” button. Give the IAM role a unique name and an optional description and click the “Create role” button.

Fig. 3.2. — An IAM role is created so that the Delta Sharing server can read the shared Delta table(s) and answer sharing queries.

Step 4: Select an EC2 instance type

Another powerful design choice of Delta Sharing is that none of the shared data traverses the sharing server. Instead, the Delta Sharing server generates a list of pre-signed file URLs that answer a Delta Sharing query.

Fig. 4.1 — The shared data does not traverse the Delta Sharing server. Instead, the Delta Sharing server generates a list of pre-signed file URLs that answer a sharing query.

Consequently, the Delta Sharing server can be a small EC2 instance. Plus, since we’re using this instance for development and experimentation, we’ll select an EC2 instance type that qualifies for the AWS “free tier” eligible.

From the AWS console, navigate to the EC2 service and click the “Launch instance” button. Give the EC2 instance a unique name, select Ubuntu as the operating system, and select a single “t2 micro” EC2 instance type. Next, click the link to “Create a new key pair”, select RSA as the key pair type, “.pem” as the key file type, and save the key pair to your local filesystem. This key pair will be used in the following section to open an SSH connection and install the Delta Sharing server. Next, check the checkbox to “Allow HTTP traffic from the internet” so that we can submit API requests to the sharing server. Scroll down and expand the “Advanced details” section. Click the dropdown for the IAM instance profile and select the IAM role we created in the previous step. Lastly, click “Launch instance” to accept the remaining default settings.

Fig. 4.2 — Launch a new EC2 instance that falls under the “free tier” eligible resources for the Delta Sharing server. Ensure that you select the IAM role created earlier before launching the instance.

Installing the Delta Sharing Server

Step 5: Install the Latest Delta Sharing Release

Now that we’ve provisioned an EC2 instance, the next step is to open an SSH connection and install the latest Delta Sharing release.

First, navigate to the AWS console and click on the newly created EC2 instance Id. Note the public DNS name of the EC2 instance (you will need the public DNS name of the EC2 instance in later steps). Next, open a new bash terminal and open an SSH connection to the EC2 instance, specifying the location of the key pair file (“.pem” file) downloaded in the previous step.

$ chmod 400 <path/to/keyfile>.pem
$ ssh -i <path/to/keyfile>.pem ubuntu@<public DNS name of the EC2 instance>

Before we get started, we’ll need to install a few updates, the unzip utility, and the Java Runtime Environment.

$ sudo apt-get update
$ ​​sudo apt-get install zip unzip
$ sudo apt-get install openjdk-8-jdk

Next, use the “wget” utility to download the latest release of Delta Sharing and unzip the file contents. At the time of this writing, the latest release is Delta Sharing 0.5.2. Information regarding the Delta Sharing releases can be found on the public GitHub repo.

$ wget https://github.com/delta-io/delta-sharing/releases/download/v0.5.2/delta-sharing-server-0.5.2.zip
$ unzip delta-sharing-server-0.5.2.zip

Step 6: Update the Server Configuration

The server configuration defines important server attributes like the location of the shared datasets, the server endpoint, and an access token.

We’ll need to update the server configuration with the location of the Delta table on Amazon S3. Begin by changing directories to the server conf directory. Next, rename the sample configuration file to remove the “.template” extension.

$ cd delta-sharing-server-0.5.2/conf
$ mv delta-sharing-server.yaml.template delta-sharing-server.yaml
$ vim delta-sharing-server.yaml

Next, open the server configuration file using your favorite editor and replace the default “shares” configuration block with the following YAML config. Ensure that you’ve updated the location with the S3 bucket and any subfolders you may have created in the first step.

shares:
- name: "tweets-share"
schemas:
- name: "default"
tables:
- name: "tweets"
location: "s3a://<your-s3-bucket-name>/<optional-subfolder(s)>"

Lastly, add an access token by adding an entry for a “bearerToken” at the end of the server configuration file. This access token will be used by data recipients to send API requests to the sharing server.

authorization:
bearerToken: “<replace_with_a_unique_token>”

The final server configuration file should look similar to the following:

# The format version of this config file
version: 1
# Config shares/schemas/tables to share
shares:
- name: "tweets-share"
schemas:
- name: "default"
tables:
- name: "tweets"
location: "s3a://netrig-analytics-sharing/tweets/new_blue_check_tweets"
# Set the hostname that the server will use
host: "localhost"
# Set the port that the server will listen on. Note: using ports below 1024
# may require a privileged user in some operating systems.
port: 8080
# Set the URL prefix for the REST APIs
endpoint: "/delta-sharing"
# Set the timeout of S3 presigned URL in seconds
preSignedUrlTimeoutSeconds: 3600
# How many tables to cache in the server
deltaTableCacheSize: 10
# Whether we can accept working with a stale version of the table. This is useful when sharing
# static tables that will never be changed.
stalenessAcceptable: false
# Whether to evaluate user-provided `predicateHints`
evaluatePredicateHints: false
# The data recipient access token
authorization:
bearerToken: "eBhh@"

Step 7: Starting the Server

Finally, navigate back to the Delta Sharing server root directory and start the server by running the following command:

$ cd ..
$ bin/delta-sharing-server --config conf/delta-sharing-server.yaml

Step 8: Add a new Inbound Rule

Before we can submit any API requests to the Delta Sharing server, we need to add a custom TCP rule that will allow incoming traffic on the port that the server is listening on. From the AWS console, click on the “Security” tab on the EC2 instance and click the name of the security group. Click on the “Edit inbound rules” button and add a new custom TCP rule to allow all traffic on port 8080.

Fig. 8.1 — A custom TCP inbound rule is added to allow the Delta Sharing server to listen for incoming requests on port 8080.

Click the “Save rules” button to save the changes.

Sending a Delta Sharing Request

We can now test the new Delta Sharing server by sending an API request.

Step 9: List all available tables

Begin by opening Postman and creating a new request. Click on the “Authorization” tab, select “Bearer token” from the dropdown, and enter the access token value that you entered in the sharing server configuration file. Next, in the URL bar select “GET” as the HTTP method and enter the public DNS name (this can be found from the AWS console) of the EC2 instance followed by the port number, 8080 in this case, and the API endpoint for listing all tables under the “tweets-share”. The full URL should look like:

http://<ec2_public_dns_name>:8080/delta-sharing/shares/tweets-share/all-tables
Fig. 9.1 — An example HTTP GET request using Postman to list all tables under the “tweets-share”.

Step 10: List table metadata

Using Delta Sharing, we can also describe a shared table by requesting the metadata of the table. Update the URL with an API request to query the “tweets” table metadata. The URL should look like:

http://<ec2_public_dns_name>:8080/delta-sharing/shares/tweets-share/schemas/default/tables/tweets/metadata
Fig. 10.1 — A sample HTTP GET request using Postman for describing the table metadata for a shared Delta table called “tweets”.

Step 11: Query the shared Delta table

Lastly, we can submit a sharing query request to retrieve a list of pre-signed file URLs that make up a Delta table. A Delta Sharing client, like the Python connector, will execute a similar API request when a function such as “load_as_pandas()” is invoked and the Delta table is loaded as a Pandas DataFrame. From the request type dropdown, change the HTTP method to “POST”. Next, update the URL with the query table API endpoint. The URL should look like:

http://<ec2_public_dns_name>:8080/delta-sharing/shares/tweets-share/schemas/default/tables/tweets/query
Fig. 11.1 — An example HTTP POST request using Postman for querying the shared Delta table called “tweets”.

Conclusion

Congratulations! You’ve now configured your very own Delta Sharing server in AWS. By now, you should be able to:

  1. Upload a shared dataset on Amazon S3
  2. Configure an Amazon EC2 instance with permission to access the shared dataset
  3. Install the latest release of the Delta Sharing reference server
  4. Submit API requests to list all shared tables, load the metadata of a table, and query the table files on Amazon S3

While this example of the Delta Sharing server isn’t quite ready for production, you should understand how the Delta Sharing server works at a high level and understand the benefit of the Delta Sharing protocol. Stay tuned for future articles where we’ll add security using Amazon’s API gateway, add a custom domain name, and explore a serverless architecture for the Delta Sharing server.

Want to get started using the Delta Sharing connector for Python? Check out these quickstart examples using Delta Sharing for Python and Apache Spark.

Are you interested in joining the Delta Sharing community? You can follow along by joining the public Slack channel.

--

--

Will Girten

Professional Data Lake Diver 🤿 | SSA @Databricks | Content Creator @Netrig Analytics | Follow me for the latest Data Lakehouse tuning tips!