End-to-end Azure data engineering project — Part 3: Creating data pipelines and scheduling using Azure Data Factory

Patrick Nguyen
4 min readJun 4, 2023

--

This is a series of 4 articles demonstrating the end-to-end data engineering process on the Azure platform, using Azure Data Lake, DataBricks, Azure Data Factory, Python, Power BI and Spark technology. In this part 3, we will use Azure Data Factory to build data pipelines and schedules.

Please review the whole series here:

End-to-end Azure data engineering project — Part 1: Project Requirement, Solution Architecture and ADF reading data from API

End-to-end Azure data engineering project — Part 2: Using DataBricks to ingest and transform data

End-to-end Azure data engineering project — Part 3: Creating data pipelines and scheduling using Azure Data Factory

End-to-end Azure data engineering project — Part 4: Data Analysis and Data Visualization (Power BI)

7. Create Data Pipelines in Azure Data Factory to orchestrate Databricks Notebooks

First, create a new Azure Data Factory service on Azure, create new pipeline in ADF and drag Databricks — Notebook to the activity screen.

Set up a new linked service for Databricks as below. For Managed service identity and User Assigned Managed Identity, grant a Contributor role to both identities in Azure Databricks resource’s Access control menu in order to select the existing interactive cluster. Test connection and if it’s successful, click create.

Select the DataBricks Notebook you want to run in the Notebook path. In this case, I choose Ingest Circuits

After completion, you can click debug and publish this activity. Next, create the same activities for the rest of the notebooks in the Ingestion folder by cloning the existing one and changing the notebook path files. All notebooks can be set to run parallelly.

You may want to check if the raw container contains any files, if yes, we trigger Databricks notebooks. In order to do that, we can create 2 other activities: Get Metadata and If Condition. In the Get Metadata activity, you can create a Linked service to Azure Data Lake Storage and set the container as raw, leaving the Directory and File Name blank. Also, choose Child Items as Argument as below:

Set up dependency by dragging the arrow from Success of Get Metadata to If Condition Activity. You need to move all of Databricks Notebooks to True Condition. In order to run the pipeline only if the raw container contains files, Go to Acitivies of If Condition, then type the expression as in the below screenshot

Great, now your pipeline only runs if Raw Container contains a file. Next, we will create another pipeline to execute DataBricks transformation notebooks.

In ADF, create another pipeline (you can clone the first one) and rename it. In order to execute the transformation, we have to run the Ingestion pipeline first. Search for Execute pipeline, drag it to the screen and choose the Ingestion pipeline.

Next, create 3 Databricks activities to execute 3 transformation notebooks that we’ve created in Part 2 of the articles. If you still remember, the driver_standings and constructor_standings use the race_results as the input. Therefore, I create a dependency here.

Publish the changes. ADF pipelines are ready, we can create a trigger to run pipelines at certain date/time. You can choose Schedule or Tumbling Window.

After that, adding this trigger to the Transformation pipeline. this pipeline also triggers the Ingestion Pipeline as the first step in the process.

You can test the trigger and monitor the pipeline process in the Monitor menu.

Congratulations on completing the 3 parts of the project. We went through the solution design, reading data from source, data ingestion and transformation, and data pipeline orchestration. At this point, you completed most of the data engineering tasks to have transformed data ready for data analysts and scientists. In the next part, I will introduce how data consumers use the transformed data to do analysis and visualization. By understanding that, data engineers can coordinate better with data analysts to build good data pipelines.

Patrick N

If you like what you read, consider joining Medium and reading many more articles. A portion of your fee goes to support authors like me. Click here to join.

--

--

Patrick Nguyen

Data Enthusiast with more than 15 years of experience in Data Engineering, Data Warehouse and Data Analytics. Ex Oracle/IBM