Databricks Job Clusters: Transient Workflows

Matt Weingarten
3 min readAug 29, 2022

--

Databricks > Snowflake. Yeah, I said it

Introduction

When I wrote about Databricks best practices a few weeks ago, I mentioned that having an isolated cluster for job runs was a good approach so that it’d be separated from the interactive queries and development people would be doing on a daily basis. While that’s true, job clusters provide even more of an optimization on the above thought. Let’s explore.

Job Clusters

Job clusters are isolated to each particular job in the case that a certain job needs a different configuration than the others (larger nodes, different Spark settings, etc.). This means that each run of a job will have its own cluster as opposed to multiple jobs sharing the same cluster, which was the case in what I had mentioned in that best practices post. This leads to even less resource contention, which is a good thing.

Furthermore, job runs will be transient: the cluster will spin up, the job will run, and then the cluster will terminate. This is what we were doing with EMR to keep costs optimal, so it makes sense to extend the same logic to Databricks as well. We won’t be paying for the inactivity period that exists before a cluster shuts itself down (although, you can make this shorter than the default for an interactive cluster if you wanted to).

Configuring Job Clusters

Configuring job clusters can be done in the Databricks UI when setting up the jobs themselves (or editing an existing one). Change the cluster to a shared job cluster and pass in the configuration details (the frustrating part is that there’s no way to copy the settings from an existing cluster, meaning there’s a lot of copy/paste needed to achieve the same basic settings as an interactive cluster). After that, you’re pretty much good to go.

One caveat is that you might have additional libraries on your interactive cluster that you’d like to have on your job cluster as well. There’s no way to install these on a job cluster the old-fashioned way as they’re only existing for the duration of their run. Rather, it’s necessary to add these libraries as dependent libraries to the job definition, so that they’re installed on the cluster as it’s being spun up.

Another point worth noting is that runtimes will go up for jobs with job clusters as you will be waiting for a cluster to be spun up before running the code. With an interactive cluster that’s almost always on, you don’t have to lose that time. If SLAs are tight, you might need a different approach.

The Future State of Databricks Jobs

We’re relatively early into our adoption of Databricks as our day-to-day tool, so we’re learning best practices as we go until we become more established. Currently, a lot of manual work is necessary to set up these job clusters (and multiply that work by how many different jobs exist).

Obviously, a more automated solution is necessary, and that’s where more structured job deployments comes into the picture. Jobs shouldn’t be created through the UI. Rather, a CI/CD tool should take a notebook or codebase (stored in version control) and use that, along with other parameters (cluster tags, role to use, Spark settings, etc.) to create the job. With this approach, it’s fairly easy to spin up a job and put a job cluster into effect (a one-time YAML definition, perhaps?).

That’s a future state where I’d like to see our Databricks work. Ad-hoc analysis can stay within the UI, but everything that should be productionized should go through a normal deployment process, with proper code review as well. Keep UI production use to a minimum.

Conclusion

Job clusters are the way to go when it comes to controlling Databricks costs, and keeping isolated environments intact. While they can be frustrating to set up, a CI/CD approach will allow it to be a simple process in the future.

--

--

Matt Weingarten

Currently a Data Engineer at Samsara. Previously at Disney, Meta, and Nielsen. Bridge player and sports fan. Thoughts are my own.