The Balanced Lakehouse: Load-Balancing Databricks Workspaces Using Cloud CDNs

Published in

Databricks Platform SME

8 min readJan 9, 2024

This article was co-written with Ganesh Rajagopal as part of the Databricks Platform articles on HA/DR topics. Keep an eye out for more posts soon!

Intro

Over the past several months, we’ve heard from many customers that need cross-region, or even cross-cloud, availability. There are tools- such as the excellent Terraform Exporter- that allow the simple export and import of workspace assets (i.e., notebooks, jobs, cluster configurations, etc.). However, a thorn in the side of any existing DR strategy is the need to switch URLs when a workspace failover occurs- this has the potential to impact every single user and downstream consumer of Databricks, which could number in the thousands (or even tens of thousands!). Instead, it would be ideal to have an abstracted URL in front of your Lakehouse that you control, and that does not change. In this blog, we’ll provide a simple, cloud-native approach to achieving stable URLs for your Databricks Lakehouse.

Background: Why Stable URLs?

A stable URL is an important hallmark of the SaaS age: it provides a single point of entry for users, applications, and automation tools that won’t change over time. The more central an application is, the more important it is that its URL does not change; for applications at the core of a business’s data architecture, a single path change could impact petabytes of data and thousands of consumers. As a core of the Lakehouse architecture, Databricks often falls into this position: hundreds of downstream third-party services and applications may rely on a Databricks SQL Warehouse or JDBC endpoint to process mission-critical workloads. This makes recovering from a disaster- already a difficult process in the best of times- even more difficult.

Ideally, what most customers would like is an immutable, customizable front-end URL that abstracts the complexity of different workspaces, regions, and services. This is exactly what we were able to accomplish using Cloud-Native Content Delivery Network (CDN) services such as AWS CloudFront and Azure Front Door.

A CDN Crash Course

As the name suggests, a CDN is a tool that can be used to deliver content to consumers in an optimized way; historically, CDNs have been used to provide caching services at globally distributed locations that are often “closer” to the end user than the source application. This often consists of a network of serving devices that users will directly interact with, along with backend services that control caching behavior, security, and custom logic that needs to be built into the delivery process. A simple abstraction is shown below.

In this case, a user in New York is attempting to access an application in London; while the user might be able to directly access the application, there may be a few problems:

Traffic going directly from the user to the application would go over public internet routes, which may not be optimized for latency and could include additional hops.
The application may not have a permanent address; for example, an ephemeral Databricks cluster URL might change weekly, daily, or even hourly. The user would need to update their connection parameters every time the URL changed, which they may not even know.
The application owner may not want to allow direct traffic; this could pose a security threat.
Directly accessing the application means caching may not occur, and there would be a full round-trip back and forth every time the user wanted to access the application.

In this case, the CDN solves these problems by providing an abstraction layer; the user accesses the CDN endpoint instead of the application itself and lets the backend decide how to route the traffic appropriately using the high-speed cloud backbone. In many cases, the CDN will use a cache instead of performing a full round-trip to the app, meaning a faster experience. Most Cloud CDNs also include security features such as WAFs to filter unwanted traffic.

When it comes to Databricks URLs, there are three components to consider: the workspace itself, clusters, and SQL Warehouses. The workspace URL will be used by users (for usual access to the UI) and by any API calls (i.e., automating job creation and scheduling). Clusters and SQL Warehouses are most often accessed by third-party tools such as Tableau or PowerBI; these are typically stable for the lifetime of the cluster or warehouse but are not guaranteed to remain stable indefinitely.

So, how do Cloud CDNs help us with Stable URLs on Databricks? The next section will outline an example pattern.

Implementing Load-Balanced Workspaces with CDN

For the purpose of Databricks Stable URLs, we forego the caching mechanisms of CDNs; we’re interested only in the ability of a CDN to provide a front-end URL that we can point to various back-end services without end users needing to take special steps. AWS (CloudFront), Azure (Front Door), and GCP (Google Cloud CDN) all have CDN services capable of providing this functionality, although they will all differ very slightly. The diagram below shows a generic pattern for CDN-Backed Stable URLs.

A high-level CDN architecture for load-balanced Databricks workspaces.

Starting from the bottom of the diagram up, we have two Databricks Workspaces; typically, these would be in separate regions to provide redundancy in the event of a regional outage. Each of these workspaces has a SQL Warehouse (and likely would have many SQL Warehouses; we show one for clarity). These two workspaces are kept in sync via a tool such as the Terraform Exporter; data would also be synced between regions using, i.e., Delta Deep Clone. The full implementation of this replication is out of scope for this article- the Databricks Blog has several examples of reference deployments.

Moving up the diagram, we have routes to the active workspace/warehouse, and in the case of a failover, routes to the secondary workspace/warehouse. These come from the CDN endpoint- in this case, we use a single endpoint, although multiple endpoints could be used. In our simple implementation, we control the warehouse routes by using cloud-native tools such as AWS CloudFront Functions or Azure Front Door Rulesets; more complicated logic can be implemented using AWS Lambda or Azure Functions as well. Warehouses need to be abstracted because every warehouse has a different unique ID, even if it is created directly from an identical Terraform template; for the user to access a single, stable URL, we hide the warehouse itself behind a path such as /warehouse. This could be more descriptive (i.e., /finance-warehouse) to allow many warehouses leveraged by different BUs or use cases. This path is appended to the chosen front-end URL; most CDNs allow custom CNAMEs so that your users can access a “friendly” URL.

Finally, we have automation tooling in place to control whether users are routed to the primary or secondary workspace. This will vary depending on the cloud, as will the way you store the various parameters needed to point to different warehouses. Whatever the chosen solution, two main things need to happen when a disaster occurs:

The origin needs to be updated from the primary URL (i.e., primary-ws.cloud.databricks.com) to the secondary URL (i.e., secondary-ws.cloud.databricks.com)
The warehouse ID(s) need to be updated, so that when a user visits my-endpoint.cdn.net/warehouse, the appropriate warehouse is targeted; in AWS, this can be done by using a KeyValueStore, and in Azure, this can be accomplished with a RuleSet or by using Azure Key Vault.

The cloud function noted in this simple example is very basic; it is applied when a user hits the /warehouse path, and just forms the appropriate redirect URL. That is, when a user visits my-ws.cdn.net/warehouse, the function will be invoked to pull the active warehouse ID and form a redirect header to point users to this URL. In Azure, this can be done using a Rule Set; in AWS, a simple CloudFront Function like below can be used.

import cf from ‘cloudfront’;
const kvstore = “<my-kvs-id>”;
const kvstorehandle = cf.kvs(kvstore);

async function handler(event) {
    // get warehouse/workspace from kvstore, form redirect URL
    const WHID = kvsHandle.get(warehouse);
    const WSID = kvsHandle.get(workspace);
    const newurl = `https://${WSID}/sql/1.0/warehouses/${WHID}`;

    const response = {
        statusCode: 302,
        statusDescription: 'Found',
        headers:
            { "location": { "value": newurl } }
        }
    return response;
}

See this repo for sample Terraform code for Azure and AWS to deploy CDN infrastructure for a Stable URL.

The Fine Print: Costs & Gaps

We’ve laid out a simple solution for Stable URLs above; this pattern could be expanded to include many different scenarios of various complexity. Especially when cloud functions are added, there are a lot of interesting applications that could be accomplished without much additional effort. However, there are a few overarching points you should consider when thinking about this architecture:

Nothing is free on the cloud. CDN services are useful and usually fairly simple, but you should consider how much traffic will be going through the service, and plan your cost accordingly. In this architecture, your data likely will not traverse the CDN in any meaningful way, but all user traffic will- this can add up. Use the AWS, Azure, or GCP calculator to plan your costs and make sure a CDN is a viable option.
Users who leverage Personal Access Tokens (PATs) will not be able to use the same PAT across different workspaces. Today, PATs exist at a workspace level only. If possible, use OAuth to authenticate instead, since this can use a stable CDN URL.
Any system can get complex when hundreds of moving pieces are involved, this one included. If you plan on backing hundreds of warehouses with this approach, be sure to come up with a good mapping scheme. Doing the 1:1 mapping by hand probably isn’t the best choice.

Conclusions

In this blog, we laid out a simple solution to achieving Stable URLs across multiple Databricks workspaces. Those workspaces might be in the same region, across regions, or even on different clouds- and users don’t necessarily need to know the difference. This can be a lifesaver when you need to switch workspaces and want to avoid hundreds or thousands of end users needing to change their connection strings and provides a useful abstraction layer to the Lakehouse. If you’re interested in trying this pattern out, you can try our Terraform template that will create the simple Azure Front Door or AWS CloudFront architecture shown above.

Happy Lakehousing!