Building an end-to-end data pipeline using Azure Databricks (Part-2)

Alonso Medina Donayre
4 min readSep 10, 2022

Set up Azure Services

In this article we will be creating all the resources we need before we start implementing our solution.

Step 1 — Create Resource Group

We need to create a resource group for all our resources.

  • Select your subscription
  • Write a valid name for the resource group
  • Select your region
  • Click on review + create
  • Click on create

Finally , our resource has been created.

Step 2— Create Azure Databricks Cluster

Before we create our cluster we need to create an Azure Databricks Workspace.

  • Search for Azure Databricks in your search box and then click on create. Follow the configuration in picture below, in this case for Azure Databricks we only need to modify Basics Tab, then click on review and create.
  • Once you have created your Azure Databricks Workspace, you can launch it by clicking on Launch Workspace button, this will open a new window.
  • Once you are on your Databricks workspace, click on Compute on the left panel and then in Create Cluster, this will load a new view.
  • We are going to use a Single Node Cluster because of the limitations of our Azure Free Subscription, but It’s enough to develop our solution.
  • Follow the configurations shown in the picture and click on create, wait between 4 and 7 minutes until your cluster got created.
  • Once your cluster got created, you are going to see a green checkbox, that means your cluster is running, after 20 minutes of inactivity it will shut-down automatically, if this happens you just need to go to Compute, click on your cluster name and finally on start button. You can also edit your cluster configuration and change some values as the termination time after inactivity.
  • We will let our cluster for later and start configuring our containers.

Step 3 — Create Azure Storage

We need to create a storage account and then the layers we are going to need for our pipeline (bronze, silver, gold) according to Delta Lake Architecture.

  • On the search box , search for storage account, after selecting it , click on create.
  • We would need to make some changes in Basics, Advanced and Data protection tabs, set the configuration for them as it is shown in pictures below.
  • Once you have set the configuration for the 3 tabs, click on review an then create.

Basics

Advanced

Data protection

Step 4— Create Containers

Now that our storage account was created. We need to created our containers based on Delta Lake Architecture, we need to create 3 (Bronze, Silver and Gold).

  • In Bronze Layer we will store our raw data.
  • In Silver Layer we will store transformed data.
  • In Golden Layer we will store grouped data.
  • Click on Containers which is located on the left panel inside Data Storage section.
  • After clicking on Container in left panel click on “ + Container” option, a modal will appear on your right, enter the name of the corresponding container and click on Create. Repeat this process for each container (bronze, silver, gold).
  • Once you finish, you should have 3 containers as the picture below.

We finish this section, we have set up all the resources we need, in the next part you will learn how to link/mount your containers with your databricks cluster.

  1. Requirements
  2. Set up azure services
  3. Mount azure storage containers to Databricks
  4. Use case explanation
  5. Data Ingestion and Transformation
  6. Data Enrichment
  7. Pipeline using Data Factory

--

--

Alonso Medina Donayre

I am very interested in topics related to Data, Software and Management.