Isolake — A simplistic deployment design to an isolated Databricks Lakehouse on AWS

JD Braun
Databricks Platform SME
18 min readDec 8, 2023

A massive thank you to my co-contributors to this blog post and Terraform repo: Wenxin Liu and Ryan Gordon.

ChatGPT generated image of an isolated Lakehouse

TL;DR — Can we isolate the Lakehouse?

For readers who want to dive directly into the code without the background, architecture, and demo walkthrough.

What is Isolake:

  • Isolake is a simple and specialized Databricks workspace deployment design on AWS that isolates users and workloads from the public internet, utilizing Unity Catalog and AWS PrivateLink as its foundational architectural components. While this approach is already adopted by many Databricks customers, I am very excited to document, Terraform, and share this pattern with the broader tech community.

Highlights:

  • No outbound internet connectivity.
  • Scoped-down policies for VPC endpoints, storage credentials, and the cross-account role.
  • Integrated with Databricks enterprise security features like customer-managed keys.
  • Optional access solely via Amazon AppStream.

Total Cost of Ownership (TCO):

  • Databricks enterprise tier pricing based on uptime for Databricks clusters.
  • Uptime on underlying AWS compute infrastructure (EC2, EBS, etc).
  • Uptime and throughput on AWS networking components (e.g., PrivateLink endpoints).
  • KMS costs for two customer-managed keys.
  • Data storage costs in S3.
  • Optional — Uptime on AWS AppStream Fleet.

GitHub Repo — Terraform Code

Background — Complete Flexibility to Utter Isolation

Ever since I started using Databricks in 2018, I’ve been a fan of the flexibility it offered. I enjoyed how easy it was to spin up a cluster, open a notebook, start coding away in SQL or Python, download any and all packages from PyPI, and use whichever public API was available to me.

However, while I was enjoying this total freedom to analyze, enrich, and publish datasets, I wasn’t considering the trade-off I was making with security. I wasn’t vetting Python packages, verifying the public APIs I was using, or auditing my actions.

Was this because I was a malicious user? Absolutely not. I was simply unaware of the potential risks when given the flexibility to execute code in an environment that can connect to the public internet.

But, a lot has changed since 2018. Security teams are now much more aware of how to handle data tooling in a cloud environment. Over the past two years, I’ve worked closely with security teams to harden their Databricks Lakehouses. From implementing data exfiltration firewalls and writing Terraform scripts for deploying Databricks with security best practices to analyzing CloudTrail logs, the plethora of tools to mitigate risk has been continually growing.

However, during this time, I’ve seen some recurring use cases:

  • A data analyst for an insurance company running SQL queries on highly sensitive data with read-only permissions.
  • A data scientist training a large language model on GPU clusters and needing to de-identify PII data in the process.
  • A data engineer writing native PySpark code to process customer data, with a stringent requirement that all customer data be processed within their virtual private cloud (VPC).
  • A consultant or a third party handling sensitive data that isn’t allowed to be copied onto their local machine.

In each of these use cases, there is a key trend — absolutely no interaction with the public internet. If that’s the case, why do we still offer these personas the opportunity to go outbound to the World Wide Web?

Can we simplify this architecture and mitigate data leakage from non-security-conscious users to a significant extent?

Can we isolate the lakehouse?

An Isolated Deployment Design — Isolake

The answer is: yes.

The Isolake deployment design attempts to do just that by hardening a Databricks workspace on AWS with no outbound internet connectivity, scoped-down policies, enterprise security Databricks features, and optionally accessible only through AWS AppStream.

Let’s talk about why Databricks makes this level of isolation so simple to achieve. It boils down to three areas:

  • Extensive built-in features for every data persona like a SQL interface, machine learning training, multi-language capability, and flexibility in cluster selection.
  • As a platform-as-a-service, a very simple networking story which includes no public IPs on instances, no inbound communication to the clusters, and generally available support for backend connectivity to the Databricks control plane through PrivateLink.
  • A centralized governance tool in Unity Catalog that handles fine-grained controls and short-term credential generation to access data from S3 buckets.

These cornerstones are what enable this solution to be so straightforward. Without this feature set, here’s what we are looking at:

  • Piecing together multiple tools to fit each persona like a data warehouse, a notebook service, etc., or forcing a round circle into a square hole by making personas do their day-to-day in environments that aren’t familiar to them.
  • If the underlying instances required public IPs, we’d be looking at a much different networking design story. We’d be managing complex security group rules and have concerns over inbound connections from the public internet.
  • Without Unity Catalog, managing data permissions would become burdensome. Instead of defining the Identity Access Management (IAM) permissions once and then using Unity Catalog’s fine-grained controls, we’d either be adding very permissive IAM roles to cover each use case or adding a lot of IAM roles for each persona.

If our goal is to satisfy the requirements of someone like past data engineer JD from accidentally leaking data or interacting with malicious public internet assets, we’ve hit the mark.

In our next sections, we’ll be taking a deeper look at this architecture from the backend perspective — clusters interacting with Databricks and AWS services. Then from the frontend perspective — users interacting with the workspace.

At the end of the blog post, I outline experimental hardening methods. These include scoped-down network access control lists (NACLs) on the private subnets and bucket policies for the root storage of Databricks, Unity Catalog, and the data bucket.

Before we continue — Is this design for everyone?

The answer is: no.

Clearly, this type of design will not fit many use cases, and I would never recommend it in all scenarios. There are obvious downsides to isolating users from the internet, as there are countless safe assets that can be used to enrich datasets.

If you’re looking to harden your Databricks deployments but do not require this level of isolation, take a look at the following amazing resources:

Building Isolake — Backend Architecture:

Image of backend network diagram of the Isolake architecture

Backend network diagram of the Isolake architecture — So, how does a Databricks cluster spin up in the Isolake design?

Let’s use the situation where a user logs into Databricks and spins up a cluster in the console:

  • A cross-account role is assumed by Databricks to spin up the Elastic Compute Cloud (EC2) instances in a private subnet.
  • The EC2 instance, with a Databricks Amazon Machine Image (AMI), grabs artifacts from a Databricks-hosted Simple Storage Service (S3) bucket over the S3 gateway endpoint.
  • The cluster resolves connections to the control plane and to Security Token Service (STS) and Kinesis using the Virtual Private Cloud (VPC) interface endpoints.

These are three easy steps with traffic only ever leaving the boundaries of the VPC to head to the S3 gateway endpoint. Once the cluster shows the green checkmark in the workspace console, the user can start going about their day-to-day job. All the code they write will be passed through the secure cluster connectivity relay, and Unity Catalog will handle any credential generation, given they have the permissions.

You might be thinking, wait a second, what about the Hive Metastore? That’s clearly called out in the Databricks documentation as not being supported by the PrivateLink endpoint.

And that is correct, it’s not. Instead, to cover this gap, in the classic cluster and SQL warehouse configuration, we’ll be using Apache Derby.

Apache Derby is an embedded metastore, which can be used when you only need to retain table metadata during the life of the cluster. Since we’re going to be using Unity Catalog for all things metadata and governance, we don’t need that outbound connection to a Hive metastore — maintaining our isolation from the public internet.

Using Derby lets us move past any errors that we may see from Spark’s dependencies on the Hive metastore, like not being able to see the sample data that’s included in workspace deployments.

To include Apache Derby, you only need to add the following lines to either the Spark configuration of classic compute or the data access configuration of the SQL warehouse and click run:

spark.hadoop.javax.jdo.option.ConnectionUserName admin
spark.hadoop.javax.jdo.option.ConnectionURL jdbc:derby:memory:myInMemDB;create=true
spark.hadoop.javax.jdo.option.ConnectionDriverName org.apache.derby.jdbc.EmbeddedDriver
  • WARNING — Temporary Hive Metastore: Please note that since we are using a temporary Hive metastore with Apache Derby, any data saved to the Hive metastore will be lost when the cluster is terminated. It’s crucial to use Unity Catalog for all data-related activities to avoid data loss. Alternatively, consider using an external hive metastore within the same VPC as a more permanent solution until Unity Catalog becomes the only option within the platform.
  • WARNING — Serverless SQL and External Hive Metastore Capabilitiy: Currently, Serverless SQL does not support an external hive metastore. Adding the Derby configuration to the data access settings in the SQL admin will lead to issues in spinning up Serverless SQL warehouses. If you are using SQL warehouses, we recommend opting for serverless by default and not including the Derby configuration. This has numerous benefits, including maintaining the isolation boundary by avoiding internet connectivity. If your corporate policy dictates that serverless cannot be used, it is advisable to add the Derby configurations.

Let’s go ahead and walk through a demo of this backend connection in a series of pictures. I’ll show the networking configuration on the EC2 instance, the S3 gateway endpoint policy, the interface endpoints, and bring it together by performing a basic task that a data engineer and analyst would do — query data, write it to Unity Catalog, and then query it with SQL.

This demo will visually guide you through the set-up, but if at this point you want to dig into the code and get going, be sure to check out my Terraform repo.

A quick note on the high availability of AWS services:

As is common with cloud services, most API calls typically start by trying to reach their local destination, such as s3.us-east-1.amazonaws.com. However, should that not be available, they will fall back to the global URL, s3.amazonaws.com.

In this deployment design, we are dependent on regional endpoints for a functioning workspace. Therefore, we recommend that if your workspace needs an added layer of availability, follow the Data Exfiltration example above and allowlist those AWS services’ global names as needed.

In rare cases, some Databricks platform APIs may attempt to use the global name of AWS services. If this occurs, please set variables appropriately. For example, with MLFlow:

%sh
MLFLOW_S3_ENDPOINT_URL= 's3.<region>.amazonaws.com'

Buidling Isolake — Backend Demo:

Databricks cluster — EC2 instances:

Let’s start at our lowest common denominator, the underlying EC2 instance. I’ve gone ahead and spun up a cluster using the workspace console to check out the associated security group.

Image of the Security Groups

The security groups, or the firewall around the EC2 instance, only allow for communication between assets within the same security group and outbound to the prefix list for the S3 gateway endpoint. This means any traffic trying to reach anything but these assets is blocked — a meaningful first step in keeping our traffic isolated.

Going outbound route tables:

Even though we have the security group preventing traffic outside the security group, let’s take a look at the route table and where the network traffic will be routed.

Image of the Route Table

In the route table, we see traffic can only go to private IPs within the same VPC, as well as the prefix list for the S3 gateway endpoint. In most route tables in an AWS setting, you’ll see a route for 0.0.0.0/0. This is a catch-all that routes the remaining IP addresses to follow. We don’t have this in Isolake, so effectively any traffic not routing to the local VPC CIDR of 10.0.0.0/23 or the prefix list of IPs for the S3 gateway endpoint is going up a cul-de-sac.

Going to the data — S3 gateway endpoint policy:

With our security group rules and our route tables, we’ve covered outbound public internet traffic. But, despite not being able to access the public internet, what about accessing rogue buckets to exfiltrate data?

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Grant access to Databricks Root Bucket",
"Effect": "Allow",
"Principal": {
"AWS": "*"
},
"Action": [
"s3:PutObject",
"s3:ListBucket",
"s3:GetObjectVersion",
"s3:GetObject",
"s3:GetBucketLocation",
"s3:DeleteObject"
],
"Resource": [
"arn:aws:s3:::jd-dbfs-uc/*",
"arn:aws:s3:::jd-dbfs-uc"
],
"Condition": {
"StringEquals": {
"aws:PrincipalAccount": "414351767826"
},
"StringEqualsIfExists": {
"aws:SourceVpc": "vpc-0cedbf325cfe57e27"
}
}
},
{
"Sid": "Grant access to Databricks Unity Catalog Metastore Bucket",
"Effect": "Allow",
"Principal": {
"AWS": "*"
},
"Action": [
"s3:PutObject",
"s3:ListBucket",
"s3:GetObjectVersion",
"s3:GetObject",
"s3:GetBucketLocation",
"s3:DeleteObject"
],
"Resource": [
"arn:aws:s3:::jd-isolake-uc/*",
"arn:aws:s3:::jd-isolake-uc"
]
},
{
"Sid": "Grant read-only access to Data Bucket",
"Effect": "Allow",
"Principal": {
"AWS": "*"
},
"Action": [
"s3:ListBucket",
"s3:GetObjectVersion",
"s3:GetObject",
"s3:GetBucketLocation"
],
"Resource": [
"arn:aws:s3:::<bucket name>/*",
"arn:aws:s3:::<bucket name>"
]
},
{
"Sid": "Grant Databricks Read Access to Artifact and Data Buckets",
"Effect": "Allow",
"Principal": {
"AWS": "*"
},
"Action": [
"s3:ListBucket",
"s3:GetObjectVersion",
"s3:GetObject",
"s3:GetBucketLocation"
],
"Resource": [
"arn:aws:s3:::databricks-prod-artifacts-us-east-1/*",
"arn:aws:s3:::databricks-prod-artifacts-us-east-1",
"arn:aws:s3:::databricks-datasets-virginia/*",
"arn:aws:s3:::databricks-datasets-virginia"
]
},
{
"Sid": "Grant access to Databricks Log Bucket",
"Effect": "Allow",
"Principal": {
"AWS": "*"
},
"Action": [
"s3:PutObject",
"s3:ListBucket",
"s3:GetBucketLocation"
],
"Resource": [
"arn:aws:s3:::databricks-prod-storage-virginia/*",
"arn:aws:s3:::databricks-prod-storage-virginia"
],
"Condition": {
"StringEquals": {
"aws:PrincipalAccount": "414351767826"
}
}
}
]
}

This is where our restrictive S3 gateway endpoint policy takes over. This prevents users from writing to buckets not covered in the list. In this policy, we include the needed S3 buckets from a Databricks perspective, as well as the data bucket, in this case, my CloudTrail logs bucket.

How is this even working— AWS PrivateLink:

AWS PrivateLink allows traffic to traverse the AWS backbone to native and third-party services, by creating endpoints that exist within your VPC.

Image of the list of the VPC endpoints

The endpoints needed for a functioning cluster are:

  • S3 — gateway endpoint
  • Kinesis — interface endpoint
  • STS — interface endpoint
  • Databricks backend: REST API — interface endpoint
  • Databricks backend: Secure Cluster Connectivity Relay — interface endpoint

The interface endpoints are located within the cluster subnets and have the same security group, shown above, attached to them. For Kinesis and STS, we have included scoped-down policies that you can find in their respective Terraform files in the GitHub repo.

Now what — working with your data:

Let’s embrace the full Lakehouse and explore and clean a dataset (CloudTrail logs), create a table in Unity Catalog, and use a SQL warehouse to query it. The first step is to triple-check we’re not making any outbound connections with an attempt to use a GET call against api.ipify.org.

Image of a failed called to a public facing IP

Nothing, we’re not making our way to the internet. Next up, utilizing Unity Catalog as our centralized governance tool, we’ll query our CloudTrail logs and create a cleaned-up DataFrame.

Image of queried CloudTrail logs in a tabular format

Now that we have queried the data, let’s create a table in Unity Catalog and use the great new AI features to summarize and add comments to the column.

Image of writing the table to Unity Catalog
Image of the table in Unity Catalog

The data has been cleansed, a managed table has been created in Unity Catalog, and now let’s use a SQL warehouse to run standard ANSI-SQL to query it.

Image of querying the data with a SQL warehouse

Where do we go from here — continuing on the Databricks Lakehouse Journey:

With our isolated Databricks Lakehouse, we’re free to let data analysts, engineers, and scientists make key business decisions with a drastically reduced risk that a non-security-conscious user could exfiltrate data or use malicious public internet access.

If you’re sitting here wondering about bucket policies and network access control lists, be sure to check out my experimental section at the end of the blog post.

But…

What if we wanted to take it a step further? What if we only wanted users to access Databricks from a machine that we can control, monitor, and prevent common activities like copying, pasting, or admins downloading results onto their local machines?

Onto building the optional frontend architecture!

Building Isolake — Frontend Architecture:

Image of a frontend network diagram of the Isolake architecture

Now that we’ve covered the backend of Isolake, let’s dive into the optional frontend of Isolake and why it’s important for certain organizations.

While working within a web browser, a user typically still has administrative control of their local machine. This means they can copy code, paste data from their clipboard, or download the results of a DataFrame if they have enough permissions.

This level of flexibility, similar to running arbitrary code at the intersection of the public internet, isn’t inherently a bad thing. It can only create accidental use cases, like someone moving a subset of code to a spreadsheet to share with coworkers or moving code to non-version-controlled systems.

So, this type of solution will be specific to organizations where this is a clear concern. A few use cases have included:

  • Remote data personas that need to access sensitive data to perform analysis, but managing the DNS resolution off VPNs or on VPNs becomes too burdensome or complex.
  • Consultants or other third parties that need to access the workspace but do not have a corporate-approved laptop.
  • Intra-organization data sharing and analysis. If there are analysts that extend across business groups, access needs to be granted for a short period to the workspace to run certain jobs or analyze certain data points.

And let me be clear, while this may mitigate copying and pasting from the environment, it will not stop a screenshot from the computer, a phone picture, or someone looking over their shoulder and writing notes. This is meant to be another mitigation tactic to isolate the frontend of the Lakehouse.

Now, as a final reminder, this is entirely optional in the Terraform code. Amazon AppStream was selected based on my prior experience in building these secure data environments. This can be interchangeable based on your toolkit. Before we do the demo of this, let’s talk a little more about AppStream.

AppStream is a secure, fully managed application streaming service that lets you deploy streaming URLs that can be used directly from a user’s desktop. While I’m only using Firefox embedded in the sample image, these images can be fully customizable to fit your organization’s specific needs.

Now, onto the demo!

Building Isolake — Frontend Demo:

A wrapper — Amazon AppStream

AppStream has two key concepts: Fleets and Stacks.

  • A Fleet is the streaming instances that run the image you specify.
  • A Stack consists of an associated fleet, user access policies, and storage configurations.

In our Fleet example below, we specify an on-demand type with a stream.standard.small instance type. This instance is configured to run within a given networking setup pictured below.

The Stack is where we configure specific user settings that are key to our Isolake isolation principles like disabling file download, disabling file upload, and preventing clipboard copying from local devices.

From the Fleet pane, we can generate a streaming URL to access the image. You can also use a User Pool to generate a URL for the user as well as a welcome message.

Image of an AppStream Fleet
Image of an AppStream Network Details
Image of an AppStream Stack

Once we’ve logged into our AppStream instance, as pictured below with the workspace, we can use Firefox to navigate to our workspace URL. But there are two crucial steps to ensure that connection works: PrivateLink and DNS resolution.

How is this working — PrivateLink and DNS Resolution

There are two key features that need to be enabled for an individual entering through AppStream to be set up correctly so that a workspace can be accessed:

  1. An interface endpoint for the Databricks REST Endpoint. This interface endpoint, similar to how it works for backend connectivity, allows traffic to access the Databricks workspace WebApp on the AWS backbone, without needing to access the WebApp through a public IP address.
  2. A private hosted zone (PHZ) associated with the AppStream VPC. This resolves any records going out to the workspace to the Private IP of the interface endpoint, instead of to the public IP.

With this configured, you’ll be able to access the workspace by going into AppStream, opening up Firefox, and entering your domain name.

Image of PrivateLink endpoints
Image of a Private Hosted Zone

Locking it up — Private Access Setting

This is great, but what about individuals trying to access the workspace from the public internet? Won’t that still be available?

This is where the Databricks Private Access Setting (PAS) object comes into play. This object is a must-have when using a PrivateLink-enabled workspace. Within it, there is a configurable setting called “Public access”. When set to enabled, the workspace will accept inbound connections not coming from the PrivateLink endpoint. If it’s set to disabled, then it will either accept connections coming from any endpoint in the account or only specified endpoints.

In the following images, we can see the private access setting object, accessing the workspace from the AppStream instance, and the denial that I got when I tried to access the workspace from a browser over the public internet.

Note: As I mentioned, this is an optional setting in the Terraform code. If you switch it to true to enable it, be sure to swap out the private access setting from isolake-pas to isolake-pas-lockdown within the Databricks account console.

Image in Private Access Setting Object in Databricks
Image of accessing a Databricks workspace in an AppStream Instance
Image of accessing Databricks in a normal web browser

Now with that settled, we have implemented isolation from both the backend and frontend of the Lakehouse.

Can we have more though? Can we implement more hardening into the platform features? I mentioned it above, so let’s dig into it with the next section on experimental hardening methods for Databricks workspaces.

Experimental Hardening Methods

WARNING:

In this section, we’ll be discussing experimental hardening methods. These are methods that have a distinct possibility of breaking certain functionalities of the workspace or causing user lockout from certain assets (e.g., S3 buckets).

PROCEED WITH CAUTION.

Cluster Private Subnets — Network Access Control Lists (NACL):

While security groups manage traffic at the instance level, NACLs manage traffic at the subnet level. NACLs, being stateless, can become difficult and challenging to manage at scale, which is why there is a preference toward managing network traffic with security groups and egress firewalls.

However, some organizations have policies that prohibit 0.0.0.0/0 allows on NACLs, despite the traffic not being routable to any public internet destination.

In this experimental feature, after the workspace has been deployed, I restrict the NACLs to only allowing traffic out of the VPC to the IPs found in the prefix list for the S3 gateway endpoint.

Since Databricks workspace validation checks if NACLs are set up correctly, this action can only be taken post-deployment. Thus, in your account console, you will see error messages around the networking configuration.

Unity Catalog Bucket Policy:

Unity Catalog buckets for the metastore typically come without bucket policies. In this feature, we restrict the bucket policy to only allow certain actions from both the Databricks control plane, the VPC of the clusters, as well as the role and system IP of an admin who is responsible for creating and deleting the bucket.

This policy can cause lockout if the system IP and role are incorrect on deployment.

Databricks Workspace Root Storage Bucket Policy:

The Databricks root bucket has a specific bucket policy that allows the Databricks control plane to write various notebook assets like job results, long query results, MLFlow, DBFS, etc. In this bucket policy, post-deployment, we restrict it to certain prefixes that can be written to.

This list of S3 prefixes is expanding based on what feature set may write to these buckets, thus can cause downstream issues if a new feature requires a new pathway.

Data Bucket Policy:

To further enforce a layer of read-only access, the bucket policy for the data bucket in question is updated to only allow reading from the VPC that houses the clusters. This, paired along with the restrictive S3 gateway endpoint policy, creates a defense-in-depth scenario.

This policy can cause lockout if the system IP and role are incorrect on deployment.

Wrapping it all up:

In conclusion, Databricks’ flexibility in executing code can be strenuous for security teams to manage in certain situations. In the Isolake deployment design, we simplify that pattern to isolate the backend and optionally the front end from public assets while enabling the power of the built-in features of the Databricks Lakehouse.

Please take a look at our GitHub repository for the Terraform code to implement this today. As always, if you see something, say something — I’d gladly respond to any GitHub issues.

Till next time… In the meantime, I’ll be hiding away in my isolated Lakehouse.

Disclaimer: This blog post is my personal opinion and in no way represents the beliefs of my employer.

--

--