Running Databricks Notebook with Airflow

Harshal Pagar
Apache Airflow
Published in
3 min readMar 3, 2024

Today I came across a requirement to run databricks notebook from airlflow dag, so lets dive into it.

Prerequisite:

So before we start writing Airflow DAG, lets have notebook ready.

  1. Create a databricks notebook in Databricks cluster [I assume you know how to do this].
# accept country parameter
dbutils.widgets.text("country",'')
country = dbutils.widgets.get("country")
# accept week parameter
dbutils.widgets.text("week",'')
week = dbutils.widgets.get("week")
# print the input parameters
print("country {}".format(country))
print("week {}".format(week))
Sample Databricks notebook

2. Create a Databricks Job, which we will be able to trigger from Airflow. In Databricks cluster navigate to: workflows -> jobs -> create job

Provide the notebook path from workspace, and cluster details as

Once done you will be able to see details in Jobs page, note down the JOB ID after job creation.

3. Now lets move to the Airflow.

For this POC, I explored operators provided by airflow. Airflow supports Databricks Operator and we can use it accordingly our usecase.

Airflow Databricks Operator

For our use case I choose to use DatabricksRunNowOperator.

Note: Create a databricks connection in airflow. Use your databricks URL and your personal access token as password.

from datetime import datetime

from airflow import DAG
from airflow.providers.databricks.operators.databricks import DatabricksRunNowOperator

# Define the default arguments for the DAG
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2022, 1, 1),
'email_on_failure': False,
'email_on_retry': False,
}

# Define the DAG
dag = DAG(
'databricks_trigger_dag',
default_args=default_args,
description='Run Databricks notebook with parameters',
schedule=None,
params={
"country": "IND",
"week": 2
}
)

# Define the notebook task
notebook_task = DatabricksRunNowOperator(
task_id='run_databricks_notebook',
databricks_conn_id='databricks_default', # Connection ID to Databricks
job_id="1061107672020870", # Job ID of the notebook
dag=dag,
notebook_params={"country": "{{params.country}}", "week": "{{params.week}}"} # Parameters to pass to the notebook
)

# Set task dependencies
notebook_task

I have used ‘DatabricksRunNowOperator’ to trigger Notebooks which accept 2 parameters.

Now we have all components ready, lets test it.

after trigger we can go to databricks and see the job is triggered to execute notebook.

You can find the DAG at : https://github.com/harshalpagar/airflow-examples/tree/master/dags/databricks

Thanks and Happy Learning 👌

--

--