Using Variables in Databricks Workflows

Patrick Pichler
Creative Data
Published in
3 min readJul 21, 2022

How to pass on values to downstream tasks in Databricks Workflows by leveraging Job Task Values

Image by Author

Introduction

Not so long ago, Databricks introduced their fully-managed orchestration service Databricks Workflows. It allows you to orchestrate any combination of Notebooks, SQL, Spark, ML models as well as to integrate with other tools/systems via web services. Before this, you mostly had to rely on 3rd party orchestration tools to implement more complex workflows or you implemented dependencies right into the notebook by using %run or dbutils.notebook.run() known as Notebook Workflows.

Problem

In any case, it has always been kind of a challenge to use variables across notebooks/tasks and also Databricks Workflows doesn’t provide anything more streamlined at the time of this writing. You still mostly either persist values/data to storage and pick it up somewhere downstream or you use the mentioned Notebook Workflows which allows you to pass on values to input widgets from another notebook. However, building complex workflows entirely with notebooks gets messy very quickly. Firstly, it lacks in terms of visual representation of the workflow and secondly, there are no built-in monitoring or error-handling capabilities. Therefore, I will show you in this article how to leverage the Databricks Workflows experience while using variables across tasks at the same time.

Solution

The solution to this problem is called job task values as part of the Databricks Utilities for Jobs. The idea is quite simple but seems to be not yet widely used.

The utility we use is actually a sub-utility that is just available for Python which allows you to set and get arbitrary values during a job. These values are called task values within a job and are even visible in Databricks Workflows once set and a task is completed:

Image by Author

A task value can be created by using the dbutils.jobs.taskValues.set() method and is basically identified by a unique key and the name of a task in Databricks Workflows where each task can set or get multiple task values. The dbutils.jobs.taskValues.get() method allows you to access those values in any downstream task during a job. For debugging reasons outside of Databricks Workflows, you also need to provide a debug value and optionally you can also define a default value if required.

set example:

dbutils.jobs.taskValues.set(key   = "variable-name", \
value = "value")

get example:

dbutils.jobs.taskValues.get(taskKey    = "task-name-dbx-workflow", \
key = "variable-name", \
default = 7, \
debugValue = "debug-value")

You can set up to 250 task values for a job run and the task values must be able to represent the value internally in JSON format with a maximum size of 48 KiB.

Conclusion

This approach is especially interesting for passing values that are determined only at runtime such as any intermediate results. However, you could also use it in combination with static job task parameters by defining a widget only once in the first task/notebook and assign this value to such a task value for using it throughout a job without having to populate widget parameters manually for each task.

Altogether, Databricks Workflows already provides a great built-in orchestration experience, but it still lacks quite a few features compared to other dedicated orchestration tools. Personally, I believe another very basic feature missing is to control and specify conditions when downstream tasks should run in relation to one another. For instance, in Apache Airflow this concept is called trigger rules or in the SQL Server Integration Services (SSIS) this is known under precedence constraints.

--

--

Patrick Pichler
Creative Data

Promoting sustainable data and AI strategies through open data architectures.