An Introduction to Streaming ETL on Azure Databricks using Structured Streaming & Databricks Delta — Part II

Nicholas Hurt
9 min readFeb 22, 2019

--

As discussed in part I, we’re going to set up the following architecture to provide an end-to-end streaming ETL demonstration.

Azure Key Vault

Before you begin to create your Azure resources we should start by creating a Key Vault. Instead of exposing your secret keys in the code, we’ll store our keys in the vault and reference them using Azure Databricks secrets backed by Key Vault.

Log into the Azure portal, create a resource group or create one when you create Key Vault. Pick a region and use the same region and resource group throughout this setup. The name of the Key Vault is not important but for ease of reference here is a list of secrets which I reference in the code. If you use the same secret names and store your keys against these as you proceed through the set up then there will be fewer changes to make to the code when you run the demo.

Blob Storage Account key: blobstorage
Datalake Storage Account key: data lake
AAD app registration key (client secret): databrickssp
SQL Server password: sqldb
SQL Data warehouse password: sqldw
Cosmos DB key: cosmosmaster
Event hub shared access policy key: eventhub

At the very end of the setup you will have a key vault with the following secrets created. To add one you simply click Generate/Import, enter the name and your paste your secret value..

Streaming Source

The first thing we will need is some streaming data. If you have a data warehousing scenario, this may not be so easy. Data may be coming at you from one or more source systems but often moving from batch to real-time, always-on-ETL can be complex and difficult. Continuously extracting (pushing or pulling) the data can end up placing a burden on these source systems and is something you will have consider carefully, and often at times be a little creative, like running these processes from your hot DR or read-only replica database. You could consider proprietary tools or write your own application to publish these insert/update/delete events (records) to a stream. This is commonly called a “producer” application.

For our demonstration, we will simulate streaming data using a json data generator tool. This useful little application allows you to define a custom json schema, generate a “stream” of events and send to some output destination, one of which is IoT Hub. Unfortunately there was no option to send directly to Event Hub, but this became a good excuse to demonstrate some basic capabilities of Stream Analytics later. If you want to follow along with this blog, download the latest tar file and use 7zip to extract the contents, including the jar file.

The generator requires two files to run, a config file and a workflow file. Download both of these from my github repository and save them in the folder just below the jar named “conf”. Open the config file and you will notice it needs a connection string for IoT Hub. If you would like to follow these steps with how-to videos please see refer to the following clips.

Let’s start by provisioning the IoT Hub…

Azure IoT Hub

Once the IoT Hub is ready, drop into the Explorers — IoT Devices section and add a device which the generator requires. Give it a name, click save, then click on the device and you see the connection string needed for the config file.

If you have java installed, cd to the directory where you saved extracted the jar (above) and run the following command:

java -jar json-data-generator-1.4.1.jar generatorConfig.json

If all is well you should see the generator logging message events…

and over in the metrics section of your IoT Hub you should see an increase in telemetry messages sent…

Shut down the event generator for now and let’s continue with the set up. Next we’ll need an Event Hub and a Cosmos DB instance before we we wire up a Stream Analytics job.

Azure Event Hub

First create the event hub namespace — remember to use the same resource group — which is essentially a container for one or more event hubs. Here you define the pricing tier and throughput units.

Once the resource is ready, add a new event hub, give it a name and click create. Once the event hub has created, click into it, choose Shared Access Policies and add a policy for Databricks, so that it will be able to read from the stream. Make note of the connection string key as you’ll need this later:

Azure Cosmos DB

Next, add a new Cosmos DB. You start by creating a Cosmos DB account using the SQL API as this is currently the only supported configuration with Stream Analytics.

Under Containers — Browse, add a new collection with the following settings:

Azure Stream Analytics

Now we’re ready to configure the Stream Analytics which will read from Iot Hub as an input source, parse the messages and route the message to one of the above outputs, depending on the Entity Type field within each message.

Click into the inputs blade, and add a new input from Iot Hub and give it an alias. Everything else should be pre-configured already for you. Note that Stream Analytics can parse json, csv and avro.

Fun fact: Stream Analytics can support joins with two or more input sources, including lookups against reference data! The reference data should be slowly changing in nature and must reside in blob but must be a maximum size of 300 MB!

Next, click on the ouputs section and create one output sink for Event Hub, and another for Cosmos DB. Give each of the outputs an alias, all the other details should be pre-configured, except for the collection/container name and document ID which in our case is advertID. Entering the document ID ensures that Stream Analytics won’t create duplicates, but instead will perform an upsert. Then enter the query blade and write a similar query to that shown below to send the Adverts to Cosmos DB and the Impressions to Event Hub.

SELECT *
INTO "output-cdb"
FROM "input-iothub"
WHERE EntityType='Adverts'
SELECT *
INTO "output-evhub"
FROM "input-iothub"
WHERE EntityType='Impressions'

Click Save. If you query is valid, you should be able to click back to the Overview panel and start the job. Start up the data generator again and watch the monitoring chart in the Stream Analytics overview panel update after a short while…

Verify you can see some data in the Data Explorer section of your Cosmos DB and in the message metrics of your Event Hub namespace. If everything is working stop the data generator and your stream analytics job and let’s finish the final data source, Azure SQL DB.

Azure SQL DB

Create this resource, specify a blank database, add a new server and drop down to the basic pricing tier for this demo.

In the SQL Database resource, open the Query Editor and run this script to populate a small table of domains.

Storage

At this point you can choose to run the demo using Blob or Data Lake. Bearing mind that SQL DW uses Azure Blob storage you will need to create at least one Blob storage account anyway, go ahead and create that first.

Then, create a storage container so that we can can create a mount point with DBFS so scroll down to blobs and create a container. Then head over to Access Keys and make note of your secret keys.

Using Azure Data Lake Storage with Databricks requires a service principal with delegated permissions. Start by creating a Gen2 storage account with the hierarchical namespace enabled. Make note of the keys. Then follow the steps to create a service principal by registering an application and then assigning the “Storage Blob Data Contributor” role to the service principal (the registered application) at the level of the resource i.e. in the access control of the data lake storage account. I initially assigned the role at the subscription or resource group level but this resulted in the error described at the bottom of the FAQ.

Azure SQL Data Warehouse

Next, let’s create our Azure Data Warehouse where we’ll stream our data to once it’s been transformed into a star schema. Note, to keep the costs down for this demo I’ve chosen Gen1 and reduced and scaled it down to 100 DWU. To also save on costs you should pause it until you’re ready to start writing into it! Then please remember to pause it when not running the demo!

Once that’s created you need to create the master key, so click on Open in Visual Studio, and run the following command.

CREATE MASTER KEY

Azure Databricks

Lastly, let’s create our Databricks workspace. For this demo you can take advantage of the trial option with 14 day free DBUs. Note you are only billed for the time your cluster is running, and you can use the pricing calculator to get an idea of costs. Essentially though you’re paying for the underyling VMs (which are provisioned in an automatically created resource group) and DBUs which are processing units per hour. If you take the free trial, you will only pay for the VMs. There are two other aspects to pricing which require some explanation…

In the pricing calculator there are two different workload types: Data Analytics and Data Engineering. Data Engineering workloads are cheaper purely because these are your (scheduled) production jobs which don’t require the interactive notebooks experience. The billing will automatically be adjusted depending what type of workload you’re running.

The other feature which determines price is the cluster mode. High concurrency is more expensive than standard because (by my understanding) it optimises resource utilization for multi-user/concurrent query scenarios. For our purposes standard is going to be fine.

Once it’s provisioned, click Launch Workspace. This takes you to the default homepages of the Databricks workspace.

If you’re new to Databricks, Spark, or notebooks, check out the Getting Started Guide, LinkedIn Learning or Databricks Academy.

A final note before we dive into the demo. To ensure all the keys from the above resources are not exposed in the code, I used a secret scope backed by Azure Key Vault. You could run the demo without doing this, simply replace the dbutils.secrets.get(scope = “”, key = “”) api commands with your keys.

In the 3rd and final part to this blog I’ll walk you through the notebooks and we’ll run the demo from start to end.

--

--

Nicholas Hurt

My personal blog, usually tech related. My views are my own.