Orchestrate & Build ETL pipeline using Azure Databricks and Azure Data Factory v2 (Part - 1)

Published in

ETL with PySpark/SparkSQL

5 min readOct 22, 2020

With the growing need to structure large volumes of data(bigdata) and make insights out of it, ETL as a process has a major role in this process and also a tedious task. Traditional bigdata technologies like Hadoop MapReduce and Apache Hive have laid a sophisticated path to achieve this goal. But with the arrival of Spark this process has become more sophisticated in terms of performance, usage, development periods, learning process, etc.

ETL as a process involves:

Extracting various kinds of data(homogeneous or heterogeneous) from multiple sources
Cleaning and transforming them into required structure(depends on the requirement)
And finally storing them in a destination locations such as data lakes, data marts & data warehouses which will act as a central repository to store all the historical information, on which further data analysis will/can be performed.

What is Azure Databricks?

Azure Databricks is an Apache Spark-based analytics platform optimized for the Microsoft Azure cloud services platform. Designed with the founders of Apache Spark, Databricks is integrated with Azure to provide one-click setup, streamlined workflows, and an interactive workspace that enables collaboration between data scientists, data engineers, and business analysts. (Source: Microsoft documentation).

Considering you have an active Azure subscription and knowledge about the pricing information about Azure services let’s deep dive into creating the ETL pipeline using Azure Databricks and Azure Data Factory.

This blog will be divided into 2 parts, and in the first part we will discuss about creating python notebooks in Azure Databricks and executing them on a spark cluster.

Note: Before proceeding any further, deploying any of the resources used in this article, you will get charged. So, you should be aware of all the resources that you are deploying along with their pricing before deploying them.

Considering you have sound knowledge on PySpark/SparkSQL, below are the topics that will be covered in this blog:

Setting up Azure Databricks and mounting ADLS Gen2
Creating a cluster & a python notebook in Azure Databricks and
Finally implement some Transformations and load the final data (using PySpark and SparkSQL)into ADLS Gen2 destination folder — (Part 2)

Create Databricks Workspace from Azure portal:

Create an Azure Databricks Workspace from the azure portal. Select the pricing tier based on your requirement. Below are the three different pricing tiers available:

If you are trying to create the Databricks workspace for the first time in your Azure account, then you get an option to choose 14 days free trail pricing tier. The catch here is that you will get only DBU’s for free, but still you will get charged for running the Virtual Machines. So, please check the pricing documentation before creating the resource.

Once the resource is deployed navigate to the resource page and click on launch workspace. Now in a new window you will be redirected to the homepage of your databricks workspace.

Creating a Spark cluster and Setting up Python Notebook:

Once you are on the homepage of your workspace navigate to the clusters available on the left side panel. Click on “+ create cluster” button and you will see a page where you will provide the cluster configuration such as driver and worker nodes config, cluster name, cluster mode, auto scaling, initialization scripts etc…

Once you provide the required details, click on create cluster and your cluster will be up and running within no time. Now move back to homepage of your workspace and select the new notebook option available under common tasks and select the python notebook option from the popup window that appears after selecting new notebook option which looks as below:

Once you provide the name, default language and cluster hit on create to create the notebook and once the notebook is created you will be redirected to your notebook.

Since we have our notebook is ready and cluster is up and running, let’s get our hand’s dirty to write some PySpark/SparkSQL code. But before that let’s mount ADLS Gen2 (Source) to our workspace, so that we can extract the data from the source and perform our required transformations. Once we complete the entire transformations, then load the resultant data to the same ADLS Gen2 storage account, but to a different folder.

Mount ADLS Gen2 to Azure Databricks DBFS:

Mounting a storage system to your Databricks File System is a one time activity, by this means, you will have to execute the code (code used for mounting) only once in your workspace to mount your storage account, but not every time you execute a particular notebook. Even though if you try it will through an error like “Directory already mounted”

Refer to my article below for detailed implementation/explanation about how to mount ADLS Gen2 to your databricks workspace:
https://link.medium.com/tQoOhjAtNab

Also you can find different ways of mounting various storage systems to your workspace from the official databricks documentation link below:

Azure Data Lake Storage Gen2

Azure Data Lake Storage Gen2 (also known as ADLS Gen2) is a next-generation data lake solution for big data analytics…

docs.databricks.com

In the next part of this article let’s write some PySpark/SparkSQL code to perform some transformations on our source data and finally load it into our destination folder in ADLS Gen2 storage account.

Here is a sneak peak of PySpark/SparkSQL code written for the same logic, which we will discuss and implement in the next article.

PySpark/SparkSQL code written for same business logic

Also we will see how to run multiple notebooks from a single notebook using databricks utilities(dbutils).

That’s all for now!!!

Please leave your valuable comments.

Thank you for showing interest in reading this article.

Orchestrate & Build ETL pipeline using Azure Databricks and Azure Data Factory v2 (Part - 1)

Azure Data Lake Storage Gen2

Azure Data Lake Storage Gen2 (also known as ADLS Gen2) is a next-generation data lake solution for big data analytics…

Written by Pujith Vaddi