How to analyze streaming data generated by IoT devices

Published in

Slalom Technology

7 min readNov 2, 2018

The world is now flooded with real-time data generated by billions of IoT devices streaming Zettabytes of data. Identifying intelligent information and making the right business decisions based on all of this data has become a huge challenge. Recently, I had the pleasure to assist a client on an IoT migration related to this problem (click here to learn more about edge computing in IoT). During this effort, we were able to gather the streaming data generated by the devices flowing into Azure Data Lake Store (an Azure repository for analytics — more on this later)! And then a question stood in front of us: What is the best way to analyze and integrate real-time data with historical data stored in other parts of the system such as Azure Data Warehouse?

To add in more context, let’s walk through a real-life use case: Assume you own a fitness tracker company; your customers wear your fitness trackers daily, and their health data (such as calories burnt, workout activities, steps, etc…) gathered from the trackers is stored in Azure Data Lake Store. If you want to run a daily analytics report to see all the result of that day, what would you do? This is where Azure Data Lake Store, Azure Data Factory, and Azure Databricks come in!

Why the trio?

Azure Data Lake Store (ADLS): ADLS is an enterprise-wide repository for big data analytics workloads with high scalability. Unlike Azure Block Storage, which is designed for a wide variety of storage scenarios, ADLS Gen 1 is a hyper-scale repository that is optimized for big data analytics workloads. In addition, ADLS doesn’t apply any limit on the amount of data or files sizes. This means we not only can store massive amounts of real-time data, but can also arrange them in a folder structure by date, which makes it easier for future analytics.

Azure Data Factory: Azure Data Factory enables us to schedule and monitor different data-driven workflows (called data pipelines). It can process and transform the data from ADLS, Azure Data Warehouse, Azure Databricks, and Azure Machine Learning. This makes it a perfect fit to join the streaming data from IoT devices with our historical data from warehouse for analysis.

Azure Databricks: We chose Azure Databricks to run the logic because it is fast, more secure because it works directly with Azure Active Directory (AAD), and it is great to work with as a team because of its shared workspace, cluster, and jobs features. You can also just trigger the notebook in Azure Databricks with Azure Data Pipeline!

In this tutorial, you will learn how to:

Create a notebook in Azure Databricks to show all the user data retrieved on the current day
Set up a Data Pipeline to execute this notebook file in Azure Data Factory
Run a trigger to execute the data pipeline daily

Prerequisites:

Subscription to Azure
Credentials of a service principal or permissions to Azure Active Directory to create one (here’s how)
An existing ADLS. In this tutorial, my ADLS file structure looks like this: tutorialdls/data/<current_year>/<current_month>/pyhsical_data_<current_day>.csv

Let’s get started!

Create a Databricks resource

Go to Azure dashboard, on the top left side, click on the button Create a resource. And search for Databricks, click on the button Create. After the service is deployed, click on go to resource and then click on Launch Workspace.

2. Create a cluster

Click on Clusters on the left navigation bar and click on Create Cluster. Give your cluster a name. Click on Create Cluster. Name your cluster tutorialCluster and leave everything else as default.

3. Create a new notebook

Go back to the homepage by clicking on Azure Databricks on the left menu. Click on Create a blank notebook. In the popup window, enter your notebook’s name as tutorialNotebook and leave the programming language as python.

4. Create a Service Principal for Databricks

Create a service principal in Azure Active Directory(here’s how), save client’s Id, credentials and refresh URL for later use

5. Mount ADLS in Databricks

Open the notebook you just created and type in the following information.

Add user’s credentials here to assign Databricks access to ADLS:

configs = {
"dfs.adls.oauth2.access.token.provider.type": "ClientCredential",      
"dfs.adls.oauth2.client.id": <client_id>,
"dfs.adls.oauth2.credential": <client_credentials>,
"dfs.adls.oauth2.refresh.url": <refresh_url>}

Mount an existing ADLS:

Add the source path based on your ADLS’s name. For example, my source is adl://tutorialdls/azuredatalakestore.net/data/.

dbutils.fs.mount(
source = “adl://<data_lake_store_name>.azuredatalakestore.net/<folder_name>/”,mount_point = <name_your_mount_point>,extra_configs = configs)

Click on Shift+ Enter to run. You will see a result indicating it has been mounted successfully.

6. Get input value from Data Pipeline in Databricks

In later parts of the tutorial, we will create a parameter called physicalDataDb in Data Factory to pass in the name of the file we’re trying to access in Data Factory. Here we’re retrieving the value from physicalDataDb

heroDb = "data/" + getArgument(“physicalDataDb”)

7. Display all daily physical data

physicalData = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferSchema = 'false').load(physicalDataDb)physicalDataDb.show()

8. Create an Azure Data Factory

Create a new data factory and name it as tutorialDataFactory. Click on Author and monitor and enter the dashboard.

9. Create a Data Pipeline

On the homepage, click on Create a data pipeline Or you can go from the edit button on the left menu, click on ‘+’ to add a new data factory resource and select pipeline, name it DailyDataPipeline

10. Add a parameter to Data Pipeline

Switch to Parameters tab, click on + New. Type in values as below to create a parameter that will use the date time of the trigger.

11. Generate Access Token in DataBricks

In order to connect to Databricks in Data Factory, we need to generate an access token. Go to Databricks workspace, click on the user icon on the top right and select User Settings. Click on Generate New Token button. Fill in a description for the token, then copy and paste the generated token for future use.

12. Connect with Databricks Service

Go back to Azure Data Factory, click on the Data Pipeline you just created. Under Databricks, drag Notebook into the canvas. Switch to tab Azure Databricks And click on + New button.

In the new window, name the Databricks service as TutorialDatabricksService, and then select your Auzre subscription and Databricks workspace. Select a cluster, paste the access token generated in the last step to the form. And then click on Finish button

Don’t forget to click on Test Connection to verify the connection between Databricks and Data Factory has been established.

13. Connect with your notebook in Databricks Service

Navigate to Settings tab, and click on Browse button to choose the notebook file we created earlier. Here we’re using the parameter of Data Pipeline as part of the value. Type in :

@{formatDateTime(pipeline.parameters.startDate,’yyyy’)}/@{formatDateTime(pipeline.parameters.startDate,’MM’)}/physical_data_@{formatDateTime(pipeline.parameters.startDate,’dd’)}.csv

14. Create a trigger in Azure Data Factory to trigger the logic in the notebook daily

15. View result

Click on the monitor button on Data Factory work space to view all the jobs.

Click on one of the jobs to see the output and result.

Summary:

In this tutorial, you have learned how to run a daily analysis against the streaming data generated by IoT devices with the following steps:

Load and process data from ADLS in Azure Databricks
Connect to Azure Databricks in Azure Data Pipeline
Set up a trigger to run Azure Data Pipeline daily

Hope you enjoyed the tutorial!

Special shout out to Ivan Campos and Rishabh Saha who kindly offered so many constructive advice to the article. And thanks so much to Aishwarya Bhadouria who encouraged me writing and sharing my story here!