Improve Job cluster utilization with Azure data factory and Azure Databricks workflows

Umesh Pawar
5 min readJun 26, 2023

--

Introduction:

In today’s data-driven world, efficiently managing and orchestrating workload jobs is crucial. Azure Databricks provides several options for running automated jobs within its environment. In this blog post, we will explore how to orchestrate Databricks workload and jobs using Azure Data Factory (ADF) in a cost-efficient manner.

Git repo for adf pipeline template: umeshpawar2188/Azure-data-factory (github.com)

Options for Running Automated Jobs in Azure Databricks:

Using Jobs in Databricks: One way to automate jobs in Azure Databricks is by utilizing its built-in job scheduling feature. Jobs can be scheduled directly within the Databricks environment, allowing for easy automation and execution of tasks.

Using Databricks Workflows: Databricks workflows enable the stitching together of multiple individual jar, notebook, or Python jobs. This approach provides flexibility and allows for the coordination of complex data processing tasks within the Databricks environment.

Azure Data Factory: Another option for orchestrating Databricks tasks is by leveraging Azure Data Factory. ADF allows you to schedule and monitor Databricks tasks using notebook or Jar activities. This approach offers the additional benefit of centralizing monitoring for all aspects of data processing in a single location.

While the choice of the correct option for orchestrating Databricks jobs depends on specific scenarios, utilizing ADF to orchestrate tasks offers distinct advantages. Here are some key benefits:

Centralized Monitoring: Using ADF allows you to monitor all parts of the data processing pipeline in a single place. This centralized view enhances visibility and simplifies troubleshooting, enabling effective management of the entire workflow.

Cost Efficiency: In a production environment, opting for a job cluster in Databricks proves to be a cost-efficient choice. Job clusters are typically more economical compared to interactive clusters, ensuring optimal resource allocation and budget management.

In ADF pipeline, when we call different Databricks tasks using Notebook or Jar on Job cluster, each spark task from different ADF activity runs on separate Job cluster.

Though job clusters cheap, creating different job cluster for different task of the ADF pipeline adds additional delay and cost.

This additional cost and delay can be avoided by re-using the same job cluster for different spark tasks from the ADF pipeline. To achieve this, we will be using a combination of Databricks workflows and Azure data factory.

Databricks job automation with Data factory
  1. In below example, I have 3 different notebooks to be orchestrated. Instead of adding 3 individual notebook activities in data factory pipeline, I have stitched these 3 notebooks as 1 Databricks workflow.
Databricks workflow with Job id

2. ADF pipeline to trigger Databricks workflow and track the status of the workflow

Adf pipeline

3. Simplified flowchart of the above ADF pipeline

4. Used Databricks jobs REST API to trigger and monitor the jobs read more about jobs api — Jobs API 2.0 — Azure Databricks | Microsoft Learn

5. Web activity to submit/Trigger Databricks workflow Job using web activity in ADF

Trigger Job/ workflow

URL: https://<adb-workspace-url>/api/2.0/jobs/run-now

Get this job_id from Databricks workspace after creating a workflow as shown in step 1 screenshot

JOBID: {“job_id”:906999927399400}

Generate bearer token and use it for authorisation. As good practice, use key-vaults to store and retrieve any credential.

6. Step 5, returns run_id which later used for tracking the status of the job. To access run_id easily, store it in variable

7. Once job is submitted, another GET rest api is used to track the status of the job

URL: https://<adb-workspace-url>/api/2.0/jobs/runs/get?run_id=@{variables(‘run_id’)}

8. REST api used to submit/trigger the workflow, does not wait for job completion and return job id, which is used to track the status of the job as shown in step 7

9. After triggering Databricks workflow, it takes sometime to start the cluster before it actually run the spark notebook in that cluster. So we need to wait and check for the job to completion.

10. To check the status, use until activity with below expression which keeps checking the status of the job until it changes from RUNNING to something else

@not(equals(activity('Check Job Status_copy1').output.state.life_cycle_state,'RUNNING'))

Here wait activity is added to avoid frequent calls to Databricks job api to get the status

And web activity “Check Job Status_copy1” gets the status of the job using rest api method and run_id as query parameter as show in step 7

11. Once job status is changed from RUNNING to anything else, execution come out of until loop and executes next if condition activity, where checking status of the job for success of failure.

If condition is:

@and(equals(activity('Check Job Status_copy1').output.state.life_cycle_state,'TERMINATED'), 
equals(activity('Check Job Status_copy1').output.state.result_state,'SUCCESS'))

If above condition is true, means job is successful else job failed.

In failure scenario, to get some additional information, we can append the status with some additional information like reason for failure or run_page_url or some other info.

Conclusion:

In the example provided, a simplified ADF pipeline triggers the Databricks workflow and tracks its status. Using Databricks jobs REST API and a web activity in ADF, the workflow is submitted and its status is monitored. A wait activity is incorporated to reduce the frequency of status checks, improving efficiency.

By implementing this approach, organizations can streamline their Databricks workload orchestration, minimizing costs and maximizing productivity. It is crucial to ensure that the job status is checked for success or failure, allowing for additional information to be captured in case of failures.

--

--