Parameterization In Databricks

Matt Weingarten
3 min readMay 4, 2024

--

Klay Thompson: new Databricks CEO

Introduction

We maintain a growing number of Databricks jobs. To cut down on hardcoding, we aim for parameterization as much as possible. This has definitely been helpful when migrating to different environments, such as new Snowflake databases and new Databricks workspaces.

There are a few different ways to approach parameterization in Databricks, so I wanted to give an overview of those approaches before discussing how we do it currently.

Job Parameters vs. Task Parameters

The two main types of parameters in Databricks workflows are job parameters and task parameters. The naming pretty much explains how they’re used: job parameters are set at the job level and can be accessed by all tasks, while task parameters are meant to be accessed for that task only. Job parameters don’t change when invoking a job (unless you update them manually), so these should be reserved for more static variables that you want to always be able to reference. If you pass a job parameter and a task parameter with the same name, the job parameter’s value will take precedence.

There’s also the concept of task values, which are parameters you can set within a job and then access from all other tasks in that job. This is helpful if you want to set something at runtime and use everywhere else. Databricks also provides dynamic value references so you can inject intrinsic concepts such as run ID and start time/date into a job. We use a few of those values to ensure we run for the current date when doing automated job runs.

Our Approach

When we were first getting our feet wet with Databricks, we didn’t consider job parameters. I can’t even really remember if it was naivete or just thinking it’d be easier to control parameters at run time. We decided to go with a task parameter approach, but what we would do in multi-task workflows is use the first task to assign task values for all of our parameters so they could easily be reused across tasks. This cut down on duplication.

We created a few helper functions that would either take a task parameter or task value and retrieve the associated value, assigning a specified default value (determined based on context) if it couldn’t be found. What we eventually realized, though, was that we were using a lot of the same parameters in different jobs, which means that we’d have to update in many different places when those variables changed. How can we do better?

A global variables file seemed like a good solution. Essentially, we store all of our common variables (Snowflake connections, S3 locations, database and table names, etc.) in this file. We modified the helper functions so that if it couldn’t find the parameter passed to the task itself, it’d look in the global file to see if it could be found there. Only if it couldn’t be found in either place would the old default value be assigned. This allowed us to cut down on the number of parameters we have to pass to each job, and also makes it easier to do a bulk update. As all of our job notebooks are stored in version control, it’s easy to utilize this file as it’s in the same relative path as everything else.

Retrospective

It took us some time to realize how useful a global variables file would be, but I’m glad we tackled the issue of parameter sprawl (only if we could go back in time and have it from the beginning). Now we don’t have to worry about specifying so many variables every time we set up a new job, and we can easily keep these environment-independent so dev jobs take dev values and prod jobs take prod values.

We also should have accounted for job parameters on a few of the more static variables (or at least, the ones that likely will not change within a workplace). There’s an argument for having everything be configurable, but the middle ground might be the best answer ultimately.

Conclusion

Yes, a lot of rambling on a basic topic, but approaching Databricks with a proper software engineering mentality can be challenging sometimes. I’m curious to know how others have tackled this task (pun intended) over time.

--

--

Matt Weingarten

Currently a Data Engineer at Disney Streaming Services. Previously at Meta and Nielsen. Bridge player and sports fan. Thoughts are my own.