Private Cross-Cloud Delta Sharing using Databricks

Hari Selvarajan
Databricks Platform SME
9 min readMar 15, 2024

Authors:
Te Tan, Gustav Byberg Skyle, Nikolay Ulmasov, Andrew Weaver, Hari Selvarajan

Introduction

In today’s dynamic business landscape, organizations are increasingly embracing multi-cloud strategies to optimize functionality, enhance security, and achieve scalability. By harnessing the distinctive capabilities of multiple clouds, companies can streamline workloads, boost agility, and reduce costs, effectively avoiding vendor lock-in and enabling negotiations for favourable deals from diverse providers. Despite these benefits, challenges like data fragmentation and duplication may arise.

In the realm of cloud storage, many organizations choose to restrict public access to their data, driven by heightened concerns about data security and privacy. This intentional decision aims to safeguard sensitive information from unauthorized access and cyber threats. Consequently, as organizations implement multi-cloud strategies, they grapple with the need to strike a balance between accessibility and robust security measures.

To address these concerns, Delta Sharing, an open protocol for secure data sharing, emerges as a crucial asset. It facilitates secure data sharing across different clouds without replication, enabling organizations to maintain a nuanced balance between accessibility and stringent security.

Secure Delta Sharing Challenge

The below diagram explains the architecture of delta sharing. As it is evident, delta sharing requires the recipient to have direct access to the storage of the producer. While the delta sharing protocol ensures that only authorized recipients are allowed to access the data, this still requires the storage to be openly accessible, which is not widely acceptable for many customers.

To address network security, we have looked at different ways to secure the network route to ensure secure delta sharing takes place cross-cloud.

In general, we observe four types of policies that can be adopted by organizations pursuing a cross-cloud strategy:

  1. Utilizing IP allows for a specific portion of storage or a data lake to grant access to external consumers outside the cloud network. For most customers this option is the simplest and best solution already.
  2. Using R2 Cloudflare and Databricks to share data with joint customers and reduce egress costs. More details here.
  3. Implementing a cross-cloud VPN to access data provider storage, allowing for a no-public-access policy while routing traffic through the public internet in an encrypted mode.
  4. Establishing an on-premises edge connection to access data provider storage, ensuring a no-public-access policy over a completely private connection.

In this article, we will look at the 3rd option above. We will explain how to set up a cross-cloud VPC connection and use that to set up private delta sharing between two entities. We will also look at the added cost involved in setting up such infrastructure for example use case.

The details of implementing each cloud-to-cloud VPN connection can be found in the GitHub repo here.

Solution

Prerequisites

Before we go into the details of the secure delta share architecture, it’s important to list the prerequisite with respect to Databricks workspace/account set-up and the cloud infra that should be in place.

Databricks:

  • UC metastore created in the respective cloud account console with delta sharing enabled
  • Databricks workspace set up on two clouds (aws-azure, aws-gcp, gcp-azure) and mapped to the metastore
  • D2D delta share set up between two clouds workspaces:
    — Add consumer DB ID as a recipient
    — Create a share
    — Add tables to the share
    — Grant access to that share to the recipient

Cloud Infrastructure:

  • Customer VNet/VPC set-up
  • Data stored in customer storage (s3, adls, gcs)
  • Securing the cloud storage (explained below in the respective section)
  • Cloud permission to create the necessary infrastructure for secure VPN gateway connectivity

Network Architecture

The below diagram shows the VPN set up between Azure (as the data producer) and AWS (as the data consumer).

In addition to the Databricks infra mentioned in the prerequisites, setting up the additional set requires two main activities: Securing the Storage and Setting up the VPN connection.

Private Storage Network

Cloud storage often comes with rich security features to protect the data from unintended access. For each cloud there are different ways to implement restrictions to cloud storage on the network level and force network traffic to go through the cloud vendor backbone. Here we will introduce the ways to make cloud storage “private” in Azure, AWS and GCP.

Azure Storage Account

To ensure the storage account in Azure Databricks is private, we disable public access on the storage account and create dedicated private endpoints between the workspace VNet and the storage account (here is the doc Use private endpoints for Azure Storage)

AWS S3 Bucket

  1. In the AWS case, we first need to ensure “Block public access” configuration of the S3 bucket is set. It will deny incoming unauthorized access to data but does not block any network traffic from the internet.
  2. To restrict access based on source IP, we can configure the bucket policy with the Deny effect, with the condition to whitelist the following IP addresses:
    — VPC CIDR of Databricks workspace on AWS (Ep. 10.10.0.0/16)
    — Network CIDR of Azure Databricks workspace or Databricks workspace on GCP (Ep. 10.20.0.0/16)
    — Public IP of control plane NAT in your region (check here for your region)
    — Public CIDR of Databricks standby infrastructure in your region (check here for your region)
    — Further IPs which need to be whitelisted in your corporate (check here as a reference: link)
  3. To enforce network traffic accessing the S3 bucket through the AWS backbone, we will configure the S3 VPC endpoint for our VPC. Here are the steps:
    — If your VPC only contains 2 subnets dedicated to Databricks, create a subnet to host the VPC endpoints.
    — Create a Gateway type of VPC endpoint to be used by workloads within AWS VPC
    — Since the Gateway type Endpoint is not reachable from other networks (no private IP address routable), we should create an Interface type of S3 VPC endpoint for cross-cloud access (check AWS doc for Interface type VPC endpoint creation here). Enable “private DNS only for inbound endpoints” for the endpoint. Make sure the security group allows traffic from the network CIDR of Azure or GCP workspace.

In this way, the traffic from AWS VPC will go through the Gateway type endpoint (which is not billed), while inbound traffic from Azure and GCP through VPN will go to the Interface type endpoint. (see here for more details).

Google Cloud Storage

  1. Setup Service Perimeter to restrict access to Google Cloud Storage:
    — Create a Service Perimeter In our specific case we only secure Cloud Storage service, so select “Cloud Storage API” as “Restricted Service”
    — Follow the steps here to create an Access Level to include the following public IPs that need to be whitelisted for accessing Cloud Storage APIs:
    — Public IP of control plane NAT in your region (check here)
    — Further IPs which need to be whitelisted in your corporate (check here as a reference: link)
    — Follow the steps here to configure the Ingress Policy of our Service Perimeter, which will allow the Access Level created above to access Cloud Storage API.
  2. Create a Private Service Connect endpoint in the VPC to enable private access to Google APIs including storage (see Google doc here). As we are targeting Cloud Storage service, we can select either “All Google APIs” or “VPC-SC” as “Target” when creating the endpoint.

Cross-Cloud Connection Setup

The approach to building a site-to-site VPN across clouds varies between cloud providers. Here we only describe the steps at a high level. For cloud-specific instructions please check the conclusion where links will be given.

  1. Create VPN gateways: VPN gateways are virtual devices you need to create in the cloud as a prerequisite to building VPN tunnels. Create VPN gateways in both clouds with the following considerations:
    — Align with your network architecture: Put the VPN gateway in its own subnet, which is different from the ones dedicated to Databricks workspaces. In AWS you can choose between Virtual Private Gateway and Transit Gateway. If you have the vision to set up a Hub-Spoke model, Transit Gateway is the option.
    — Routing configuration: Depending on cloud vendors, you may configure static or BGP as the routing mechanism. If you enabled BGP, you need to specify a valid ASN value. ASN values of 2 VPN gateways cannot be the same.
    — Enable active-active mode for high availability.
  2. Create “customer gateways” to host VPN gateway information: You will need to create an abstract object (named customer gateway in AWS, local network gateway in Azure, peer gateway in GCP) in the cloud to host the information of the VPN gateway in another cloud. The information to gather is mainly public IP addresses assigned to VPN gateways. This acts as a representation of “foreign” gateway devices in the cloud.
  3. Create VPN tunnels: Create multiple VPN tunnels between the VPN gateway and the “customer gateway”. You need to do this in both clouds.
  4. Configure routing rules: We want to enforce cross-cloud traffic to storage through VPN channels. If the VPN tunnel is BGP enabled, the VPN gateway is able to dynamically learn the IP addresses of the connected network in another cloud. If the VPN tunnel relies on static routing, we need to explicitly provide CIDR or the connected network.
    In addition, unlike in Azure and GCP, in AWS you need to specifically adjust the routing table associated with the cluster subnets, making the VPN gateway the destination for cross-cloud traffic.
  5. Configure private DNS: Lastly, data consumer clusters need to resolve the FQDN of cloud storage to the IP address of the private interface in another cloud. So we need to configure private DNS and assign it to the VPC of data consumer clusters.

Cost Estimates

A Delta Sharing solution described in this document will likely be more expensive than a standard Delta Sharing solution (when the Data Consumer fetches the Delta Sharing files directly from the Data Publisher’s bucket). This incremental cost will vary depending on the cloud providers used by the Data Publisher and Data Consumer, but the main cost components should be the same for any provider/consumer pairs. These are (this is not an exhaustive list and only includes incremental cost contributors, e.g. Compute and Storage costs should be the same/similar for any Delta Sharing type used) :

  1. Storage egress costs
  2. VPN (or Azure Express Route/AWS Direct Connect/GCP Interconnect) costs
  3. Private Link costs
  4. DNS costs

There could also be cost savings resulting in reduced incremental costs. For example, using S3 VPC Endpoint means there is no direct transfer out from S3 to the internet, likely resulting in savings that may cover data transfer costs over VPN.

Example use case cost

As an example, here is an estimated incremental cost for sharing a 1GB table stored in S3 for 1 month via a Private Delta Sharing solution described here over an Azure-2-AWS VPN connection * (as opposed to a standard Delta Sharing). The data Provider is on AWS (US East (Ohio)) and the Data Consumer is on Azure (North Europe) **.

Let’s say that this table is used in a dashboard refreshed once every hour.

  1. Total amount of data transferred: 30 * 24 * 1GB = 720GB
  2. AWS VPN (without Accelerators):
    - Connection: 30 * 24 * $0.05 = $36
    - Data transfer out: 100GB * 0 + 620 * $0.09 = $55.80
  3. AWS S3 Interface VPC Endpoint:
    - VPC Endpoint: 30 * 24 * $0.01 = $7.20
    - Data transfer: 720 * $0.01 = $7.20
  4. Azure VPN Gateway (basic): 30 * 24 * $0.04 = $28.80
  5. Azure DNS (no Private Resolver): $1

__________________________________________________

Sub-Total: $136

S3 Transfer OUT savings: 100GB * 0 + 620 * $0.09 = +$55.80

__________________________________________________

Total: $80.20 (per month)

__________________________________________________

* Calculations are made for basic VPN features on both clouds. Enabling extra VPN features or additional services will result in higher connectivity costs (e.g. Accelerated Connection or Transit Gateway on AWS, non-basic Gateway Types or Availability Zones on Azure, etc).

** Prices used in this calculation were valid at the time the calculation was made. For the latest prices check the following links

Conclusion

In this article, we have shown how organizations can create private delta sharing using Databricks. The approach mentioned above helps organizations share data between clouds in a secure manner without incurring significant costs. You can find more details on the implementation details of the VPN between clouds here.

--

--