Building an end-to-end data pipeline using Azure Databricks (Part-2)
Set up Azure Services
In this article we will be creating all the resources we need before we start implementing our solution.
Step 1 — Create Resource Group
We need to create a resource group for all our resources.
- Select your subscription
- Write a valid name for the resource group
- Select your region
- Click on review + create
- Click on create
Finally , our resource has been created.
Step 2— Create Azure Databricks Cluster
Before we create our cluster we need to create an Azure Databricks Workspace.
- Search for Azure Databricks in your search box and then click on create. Follow the configuration in picture below, in this case for Azure Databricks we only need to modify Basics Tab, then click on review and create.
- Once you have created your Azure Databricks Workspace, you can launch it by clicking on Launch Workspace button, this will open a new window.
- Once you are on your Databricks workspace, click on Compute on the left panel and then in Create Cluster, this will load a new view.
- We are going to use a Single Node Cluster because of the limitations of our Azure Free Subscription, but It’s enough to develop our solution.
- Follow the configurations shown in the picture and click on create, wait between 4 and 7 minutes until your cluster got created.
- Once your cluster got created, you are going to see a green checkbox, that means your cluster is running, after 20 minutes of inactivity it will shut-down automatically, if this happens you just need to go to Compute, click on your cluster name and finally on start button. You can also edit your cluster configuration and change some values as the termination time after inactivity.
- We will let our cluster for later and start configuring our containers.
Step 3 — Create Azure Storage
We need to create a storage account and then the layers we are going to need for our pipeline (bronze, silver, gold) according to Delta Lake Architecture.
- On the search box , search for storage account, after selecting it , click on create.
- We would need to make some changes in Basics, Advanced and Data protection tabs, set the configuration for them as it is shown in pictures below.
- Once you have set the configuration for the 3 tabs, click on review an then create.
Basics
Advanced
Data protection
Step 4— Create Containers
Now that our storage account was created. We need to created our containers based on Delta Lake Architecture, we need to create 3 (Bronze, Silver and Gold).
- In Bronze Layer we will store our raw data.
- In Silver Layer we will store transformed data.
- In Golden Layer we will store grouped data.
- Click on Containers which is located on the left panel inside Data Storage section.
- After clicking on Container in left panel click on “ + Container” option, a modal will appear on your right, enter the name of the corresponding container and click on Create. Repeat this process for each container (bronze, silver, gold).
- Once you finish, you should have 3 containers as the picture below.
We finish this section, we have set up all the resources we need, in the next part you will learn how to link/mount your containers with your databricks cluster.