Building an end-to-end data pipeline using Azure Databricks (Part-7)

Alonso Medina Donayre
6 min readSep 16, 2022

--

Data Factory

In this article, we are going to create our data factory service to orchestrate, schedule and monitor our process. We are going to do a simple data factory pipeline without validations because would be very hard to explain on an article.

According to the official documentation.

Azure Data Factory is the platform that solves such data scenarios. It is the cloud-based ETL and data integration service that allows you to create data-driven workflows for orchestrating data movement and transforming data at scale.

Step 1 — Set up Data Factory service

  • Search the Data Factory service on your azure portal and click on Create.
  • On Git configuration tab mark Configure Git later, because is not necessary.
  • Finally click on Review + Create > Create.

Step 2— Generate Access Token for Data Factory

The right way to give permissions to data factory to use our cluster is using manage service identity, but when I tried to use, it threw me a permission error. So we are going to use Access Token.

  • Go to your databricks workspace, select settings > User Settings.
  • Select Generate a new token, set the lifetime days and click on Generate.
  • Copy the token to a notepad because we will need it later.

Step 3 — Create Pipeline

Go to your azure data factory resource and open your Azure Data Factory Studio.

  • Go to Author > Pipelines click on the three dots and then click on New pipeline.
  • Set a name for your pipeline and the following configurations:

Settings:
— Concurrency = 1
Parameters:
— Create “p_processing_date” param of type string.

  • Search for databricks notebook activity and drag and drop it to your pipeline space.
  • We need to create a databricks linked service.Select your activity go to Azure Databricks tab and click on New.
  • Follow the below configuration (do not forget to paste your databricks access token), test your connection and click on create.

Note: sometimes your cluster name takes time to appear.
For batch processing we should take
Job Cluster, but as we are using an Azure Free account we aren’t able to do it. So stay with the Interactive Cluster.

  • Our link service has been created.
  • Now, select your notebook activity, go to settings and click on Browse, find your customer notebook inside ingestion folder and click on OK.
  • Go the Base parameters section and add one parameter named “p_file_date” (Note this param should have the same name, we use on our notebooks on databricks). Click on Add dynamic content.
  • Select your parameter and use the @formatDateTime(date, str_format) to cast it from date to string and click on OK.

@formatDateTime(pipeline().parameters.p_processing_date, ‘yyyy-MM-dd’)

Debug Pipeline

  • You can debug your pipeline, by click on Debug option on the top bar and set a valid date (for example: 2022–09–10).

This debug process can take between 7 and 10minutes because it needs to start the cluster before executing the notebook. If takes longer cancel it and run it again.

Configuring the rest of the notebooks

  • Copy and paste your activity (right click), change the name of it and select the corresponding notebooks from ingestion and enrichment folders.
  • After you finish creation, you have to link them and debug the pipeline again.

Our pipeline is working correctly. Now we have to create a trigger for it.

Step 4— Create Trigger

A trigger will run our pipeline automatically according to our configurations.

  • Go to Manage > Triggers > New
  • Follow the configurations of the picture below and click on OK.

Step 5— Associate Trigger and Run

  • Our trigger has been created, now we need to associate it with our pipeline. Go to your pipeline and click on Add Trigger > New/Edit > Choose Trigger.
  • After clicking OK, on the new screen we need to send a value to our pipeline parameter, set “@trigger().outputs.windowEndTime” and click OK.

@trigger().outputs.windowEndTime End of the window associated with the trigger run.
Official Documentation here.

  • Before Publishing our pipeline, I strongly recommend you to delete the folders from your silver and gold container. So you can verify your pipeline it’s working correctly.
  • Publish the pipeline. Publish > Publish. After publishing your pipeline will trigger automatically.
  • You can go to Monitor and verified that your pipeline is running.
  • We can validate by going to our gold container and see if files and partitions arrive.

END

This is the end of this serie of articles I hope you enjoy it! Let me your commentaries and follow me on LinkedIn.

  1. Requirements
  2. Set up azure services
  3. Mount azure storage containers to Databricks
  4. Use case explanation
  5. Data Ingestion and Transformation
  6. Data Enrichment
  7. Pipeline using Data Factory

--

--

Alonso Medina Donayre

I am very interested in topics related to Data, Software and Management.