Data Lake Creation In Azure

Sarath Sasidharan
11 min readMar 11, 2019

--

This blog focuses on how to create a data lake on Azure using a script. This enhances developer productivity and quick deployment of data lakes using the concept of IaC ( Infrastructure as Code).

This would cover :

  • Creation of storage
  • Creation of an Orchestrator / Worklow Engine
  • Creation of relational database (OLTP)
  • Creation of an ETL / Compute Engine
  • Creation of an Massively Parallel Processing Data warehouse (OLAP)
  • Single shell script when executed creates all of the above resources and bundles them in a resource group.

Prerequisites

IDE , in this case visual studio code.

Azure account

Create a new project

Create a new folder.

Open Visual Studio Code by typing in :

code .

Create a folder inside this project named infra , create a script inside this named data_lake.sh , this script will contain the steps required to setup the infrastructure needed for a data lake.

In this tutorial we are using az commands to create the resources , there are other ways on how resources could be created in azure which includes ,

ARM

Powershell

Terraform

Azure REST APIs

First step in the data lake creation is to create a data lake store.

Creation of Storage

The qualities of the storage layer need to be :

  • Easily Scalable
  • Cheap Storage
  • Easy integration with Services (Both Open Source and Proprietary Services)
  • Folder Restrictions based on POSIX to restrict access on folder levels.
  • Easy sink for all types of data sources ( Streaming as well as Batch)

For this blog , we have Azure Data Lake Store Gen 2 as the raw storage layer for all our data flowing into our data lake.

An Introduction to this can be found here

This data lake supports POSIX level file systems permission which can be set using Active Directory Entities , which can help control access in a very scalable and elegant manner.

Create a resource group which contains logical grouping of resources which are spun up on azure, click to learn more about resource group.

Copy the snippet of code as below and paste it inside the data_lake.sh file , replace the variable names with the correct values.

#!/bin/bashDATA_LAKE_RG='<REPLACE_WITH_DATA_LAKE_RESOURCE_GROUP_NAME>'
DATA_LAKE_LOCATION='<REPLACE_WITH_DATA_LAKE_REGION>'
# Create a new Resource Group
az group create \
--name $DATA_LAKE_RG \
--location $DATA_LAKE_LOCATION

This command uses az cli to request creation of a new resource group with name specified with the “- -name” flag and also specify the location of the resource group using the “- -location” flag. Location is the the region where the data center of Azure is put up. This would request the resources to be spun up in that particular region during the creation.

To run this on azure , you can :

  1. Run it from your terminal by using az login , and then executing the shell script (If you have a linux machine)
  2. Login to the azure portal (portal.azure.com) and use the cloud shell. There are two options in the cloud shell , one is bash shell and the other is a powershell.

This blog we use option 2 with bash shell, but feel free to play around with option 1 if you have a linux machine.

Go to the portal and then click on the bash shell to open a new window , where you can execute bash commands. Follow the steps as mentioned below to request a new cloud shell from the portal.

Script Execution

Once you have the shell import the script (data_lake.sh) using the selection on the window.

Upload the file data_lake.sh.

If you use a windows machine , issue the dos2unix command and also grant execute rights on the script , to do this run the following command.

dos2unix data_lake.sh
chmod +x data_lake.sh

Run the shell script , this will create a new resource group with the name specified in the azure location.

./data_lake.sh

After this script succeeds , execute the list command to see if the resource group created is present.

az group list

This can also be viewed in the azure portal from portal.azure.com.

Setup a storage account within this resource group , this would be the raw storage layer for the data lake. Add this snippet to the data_lake.sh.

Before running this on the cloud shell run the following command on the cloud shell.

az extension add --name storage-preview

This adds the extension for Azure Cli needed to install ADLS Gen2 .

Copy this snippet into data_lake.sh script , replace storage names with the right values.

DATA_LAKE_STORAGE='<REPLACE_WITH_DATA_LAKE_STORAGE_NAME>'# Create a new storage (ADLS Gen2) Accountaz storage account create \
--name $DATA_LAKE_STORAGE \
--resource-group $DATA_LAKE_RG \
--location $DATA_LAKE_LOCATION \
--sku Standard_LRS \
--kind StorageV2 \
--hierarchical-namespace true

Repeat step “Script Execution” from above .

This should create a new storage account. View the resource using the following command.

az resource list -g <NAME_OF_RESOURCE_GROUP>

The new storage should be visible here , this can also be viewed from the azure portal under storage accounts blade on the left side of the portal.

Creation of an Orchestrator / Pipeline / Workflow builder

Once the storage account is ready the next step is to create a tool which can help in loading data to the storage layer.

What we need to keep in mind for this are :

  • Connectors available from varying sources of input can range from legacy like mainframes to latest sources like snowflake for ex.
  • Ease of usage
  • Quick / Rapid pipeline development
  • Lineage
  • Availability of Triggers ( Time Based / Event Based / File Trigger /Rolling window ..etc)
  • Support for Steaming and Batch data flows
  • Connectors to push back data to different sinks
  • Integration with version control

For this blog we use Azure Data Factory Gen 2 . Read through the documentation before your proceed further to familiarize what ADF Gen2 can and cannot do , this will help during the selection of the right tool.

Az cli commands for creating the azure data factory are not yet available so we will be using ARM template to create this resource.

To know more about ARM , this is the documentation :

Create a new folder inside the project named “arm” inside the infra folder. The folder structure is shown below.

This folder holds the ARM template used to create the data factory instance. Create a new file (json) inside arm folder and then copy the contents into the new file.

{
"$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
"contentVersion": "1.0.0.0",
"parameters": {
"name": {
"type": "string",
"defaultValue": "<NAME_OF_DATA_FACTORY>"
},
"location": {
"type": "string",
"defaultValue": "<REPLACE_WITH_DATA_FACTORY_REGION>"
},
"apiVersion": {
"type": "string",
"defaultValue": "2018-06-01"
}
},
"resources": [
{
"apiVersion": "[parameters('apiVersion')]",
"name": "[parameters('name')]",
"location": "[parameters('location')]",
"type": "Microsoft.DataFactory/factories",
"identity": {
"type": "SystemAssigned"
}
}
]
}

For explanation of this json and the elements in it , refer to the documentation link above.

Replace the variable values in the json and then save this file. This json also refers to a properties file where the values of the parameters are obtained.

Create a new folder named conf at the root level and create a new file inside this folder.

Copy these contents into the properties file.

{
"$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentParameters.json#",
"contentVersion": "1.0.0.0",
"parameters": {
"name": {
"value": "<NAME_OF_DATA_FACTORY>"
},
"location": {
"value": "<REPLACE_WITH_DATA_FACTORY_REGION>"
},
"apiVersion": {
"value": "2018-06-01"
}
}
}

These values get substituted into the run time values for the parameters in the main ARM template.

Once the templates are ready we need to add this deployment instruction to the script , infra/data_lake.sh

DATA_LAKE_WORKFLOW='datalakeorchestrator'
ARM_LOCATION='arm/data_factory.json'
ARM_PROPS_LOCATION='../conf/data_factory_prop.json'
# Create Data Factory Version 2
az group deployment create \
--name $DATA_LAKE_WORKFLOW \
--resource-group $DATA_LAKE_RG \
--template-file $ARM_LOCATION \
--parameters $ARM_PROPS_LOCATION

This command deploys a Azure Data Factory Gen 2 using the template referred by the “- -template-file” argument , the properties file is referred by “- -parameters”.

The best way to transport this to the cloud shell is to commit this to a git repository and clone this to your azure cloud shell.

Check if the data factory has been created. This can also be viewed in the portal.

az resource list -g <NAME_OF_RESOURCE_GROUP>

Create a Relational Database

Source systems could be Relational Databases which can release data to the data lake.

Orchestrators / workflow engines could pull data from these systems onto the data lake storage to create a enterprise wide data sharing platform aka enterprise data lake.

For this part we use SQL DB (Managed SQL Server) on azure as our Relational Database.

In order to create this add this to the infra/data_lake.sh script , under the infra folder.

Replace the variable names with appropriate names. Ideally these variables could be stored in a configuration file , secrets can be stored and retrieved from azure keyvault.

DATA_LAKE_SERVER_NAME='<SQL_SERVER_NAME>'
DATA_LAKE_DATABASE_NAME='<SQL_DB_NAME>'
ADMIN_USERNAME='<ENTER_USERNAME>'
ADMIN_PASSWD='<ENTER_PASSWORD>'
## Create a logical server in the resource group
az sql server create \
--name $DATA_LAKE_SERVER_NAME \
--resource-group $DATA_LAKE_RG \
--location $DATA_LAKE_LOCATION \
--admin-user $ADMIN_USERNAME \
--admin-password $ADMIN_PASSWD
## Create a database in the server
az sql db create \
--resource-group $DATA_LAKE_RG \
--server $DATA_LAKE_SERVER_NAME \
--name $DATA_LAKE_DATABASE_NAME \
--sample-name AdventureWorksLT \
--service-objective S0
# Configure a firewall rule for the server
az sql server firewall-rule create \
--resource-group $DATA_LAKE_RG \
--server $DATA_LAKE_SERVER_NAME \
-n AllowYourIp \
--start-ip-address 0.0.0.0 \
--end-ip-address 0.0.0.0

Once this is ready execute , to see the resources.

az resource list -g <NAME_OF_RESOURCE_GROUP>

Create an ETL Engine

Once the storage and orchestrator have been defined the next core component in the data lake is an ETL / ELT tool.

Crucial topics in choosing an ETL engine :

  • Should be scalable / On Demand
  • Should provide flexibility to process both structured and unstructured data
  • Should be easy to develop / script jobs
  • Integration with source control systems
  • Should be able to handle stream and batch flow of data

For this blog , we go with Azure Databricks. Follow this for documentation.

Similar to Azure data factory we will use ARM (Azure Resource Manager ) Template to create an Azure Databricks instance.

Under the /infra/arm folder create a new file of your choice for example , databricks.json.

Copy the following snippet into the file.

{
"$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
"contentVersion": "1.0.0.0",
"parameters": {
"workspaceName": {
"type": "string",
"metadata": {
"description": "The name of the Azure Databricks workspace to create."
}
},
"pricingTier": {
"type": "string",
"defaultValue": "premium",
"allowedValues": [
"standard",
"premium"
],
"metadata": {
"description": "The pricing tier of workspace."
}
},
"location": {
"type": "string",
"defaultValue": "[resourceGroup().location]",
"metadata": {
"description": "Location for all resources."
}
}
},
"variables": {
"managedResourceGroupName": "[concat('databricks-rg-', parameters('workspaceName'), '-', uniqueString(parameters('workspaceName'), resourceGroup().id))]"
},
"resources": [
{
"type": "Microsoft.Databricks/workspaces",
"name": "[parameters('workspaceName')]",
"location": "[parameters('location')]",
"apiVersion": "2018-04-01",
"sku": {
"name": "[parameters('pricingTier')]"
},
"properties": {
"ManagedResourceGroupId": "[concat(subscription().id, '/resourceGroups/', variables('managedResourceGroupName'))]"
}
}
],
"outputs": {
"workspace": {
"type": "object",
"value": "[reference(resourceId('Microsoft.Databricks/workspaces', parameters('workspaceName')))]"
}
}
}

This is the code which requests a new databricks workspace from azure. There is also a properties file which needs to be added , this is under /conf/ directory. Create a new file for ex : databricks_prop.json and add the following snippet. Substitute the name of the databricks instance.

{
"$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentParameters.json#",
"contentVersion": "1.0.0.0",
"parameters": {
"workspaceName": {
"value": "<ENTER_NAME_OF_DATABRICKS_INSTANCE>"
}
}
}

Once this is done , add the trigger for these into our /infra/data_lake.sh script

Copy this snippet of code and append it to the script.

DATA_LAKE_WORKFLOW_DB='<ENTER_ANY_DEPLOYMENT_NAME>'
ARM_LOCATION_DB='arm/<NAME_OF_ARM_TEMPLATE>'
ARM_PROPS_LOCATION_DB='../conf/<NAME_OF_PROPERTY_FILE>'
# Create Azure Databricks
az group deployment create \
--name $DATA_LAKE_WORKFLOW_DB \
--resource-group $DATA_LAKE_RG \
--template-file $ARM_LOCATION_DB \
--parameters $ARM_PROPS_LOCATION_DB

If you use git the commit this code to git and then pull it from cloud shell and execute this script. On successful execution , check the resources using. Newly created databricks instance should be visible now , this can also be seen via the portal.

az resource list -g <NAME_OF_RESOURCE_GROUP>

Creation of an Massively Parallel Processing Data warehouse (OLAP)

Once the data is prepared and ready to be served an MPP data warehouse needs to be provisioned to expose the data to visualization tools.

The core requirements would be :

  • Highly scalable Data Warehouse
  • Easy to tune up performance and tune down when not needed to avoid extra costs
  • Easy connectors to Raw storage
  • SQL based query engine

For this part of the blog , SQL Data warehouse is considered.

This can be created using the az command , append the following snippet to the script data_lake.sh.

DATA_LAKE_DWH_NAME='<ENTER_SQL_DWH_NAME>'# Create a SQL Datawarehouse
az sql dw create \
--resource-group $DATA_LAKE_RG \
--server $DATA_LAKE_SERVER_NAME \
--name $DATA_LAKE_DWH_NAME

Commit this back into git and execute the script again , check the list of resources inside the resource group. This should include the newly created SQL Data Warehouse.

Adjust the script accordingly and do the required cleanup , refer a sample git repository which has been used by me for this blog.

This script /infra/data_lake.sh , would be the single click deploy script for the data lake. It would deploy a storage account , an Azure Data Factory instance, Azure Databricks Instance, SQL Database and a SQL Data warehouse.

This script can be included in a Build pipeline (Azure Devops) , which can enable automated rollouts to Test / Acceptance / Production or also as a templates to teams within the organization to start building their data lakes.

The next blog focuses on :

  • Data loading / Extraction
  • Data Preparation
  • Data Sink

The series of blogs to follow will build up on this blog with other factors like Monitoring / controls / nomenclatures / DevOps practices for data lake / Data Management etc. As a follow up of this blog you can go here for part 2 of this blog series.

--

--