Azure event ingestion with Event Hub, Data Lake and SQL Database — Part I

Francisco Beltrao
5 min readJan 8, 2018

--

In this series I will create a simple application in Azure that sends events to an ingestion service and overtime parses the raw data and saves into a SQL database.

Part I: Streaming data to Event Hub and capturing it on Data Lake
Part II: Transform captured data to csv and move to SQL Azure

Azure Components

Application Diagram
  • Event Hub: Event ingestion service. It compares to Kafka in terms of responsibilities. More info here
  • Data Lake Store: Scalable repository for big data analytics workloads. It is compatible with Web Hadoop Distributed File System (WebHDFS), allowing Hadoop to access it
  • Data Lake Analytics: Analytics jobs on-demand on top of data lake store
  • Data Factory: ETL services in Azure
  • SQL Azure: hassle free SQL Databases in Azure
  • Azure Container Instances: easy away to run container based workloads

Scenario

We will collection information from Meetup Live API and ingest into Event Hub. In Event Hub we will enable capture, which copies the ingested events in a time interval to a Storage or a Data Lake resource.

Event Hub will save the files into Data Lake. These files will be in avro format, the content is json. We will then transform all small files into a single csv file using Data Lake Analytics.

Why should saving events in files as opposed to a database you might ask? First answer is price. Keeping in a store kind of service is cheap compared to a database service if you have a high volume of data. You don’t need to worry about deleting historical data in such a construct. Data Lake Store allows you to store the data for cheap and, with the combination of additional services, allows you to process, extract and/or aggregate it. Second reason is to be able to retain events for more than 7 days, which is the current Event Hub limit.

Once the data is available in csv format we will move to SQL Azure database using Azure Data Factory.

Event ingestion with Event Hub

Before we collect Meetup events lets create the ingestion part in Azure.

1. Create an Event Hub named “meetup”. We will send Meetup events to it to be processed.

2. Create a Data Lake Store. After, create a folder called “meetup” which will be our working folder for this example

3. Still in Data Explorer select the root folder and click “Access”. We need to allow Event Hub to access Data Lake. Add a new access, searching the user for “Microsoft.EventHubs”. After selecting the user add execute permission as default and all children.

5. Now lets give write permission on the meetup folder. In Data Explorer click “meetup” folder and add for Microsoft.EventHubs read, write and execute recursively (children and default)

6. Go back to the Event Hub previously created. Create the event hub “meetup” with capture enabled. You will have to select the Data Lake Store to send the data to.

Once created, Event Hub will save the events as avro files in Data Lake Store every 5 minutes (if you did not change the default capture options). Next part: send events to the hub.

Sending data to Event Hub

In order to send data to Event Hub we need a connection string. A way to obtain one is to create a “Shared access policy” in the meetup event hub (Manage, Send and Listen to keep things simple). Once created, click on it and copy the Connection string-primary key value for later usage.

There are a few ways to send data to Event Hub. For this example I created a .NET Core console app that listens to HTTP streams and sends them to an Event Hub. Source code is available here. This console app is also available as a docker image on https://hub.docker.com/r/fbeltrao/httpstreamer/.

To run locally you can invoke docker as follows:

docker run -t -d --env HTTPSTREAM_EVENTHUB="{event hub connection string}" --env HTTPSTREAM_URL="/2/open_events" --env HTTPSTREAM_HOST="stream.meetup.com" fbeltrao/httpstreamer:0.3

To run this application in Azure the easiest is to use Azure Container Instances (ACI). The snippet below assumes that you know how to use the Azure cli. It runs the docker image in Azure Container Instance:

az container create -g <resource group> --name <name> --image fbeltrao/httpstreamer:0.3 --environment-variables HTTPSTREAM_EVENTHUB="<event hub connection string>" HTTPSTREAM_URL="/2/open_events" HTTPSTREAM_HOST="stream.meetup.com" --location <azure location>

In order to verify if the container instance is running use this cli command:

az container list -o table

Once succeeded you can check the logs

az container logs -n <aci name> -g <resource group>

After 5 minutes you should start seeing files being copied to the Data Lake Store.

Conclusion

In part I of this article we send meetup data to Event Hub and with simple configuration enable the same data to be available on Data Lake Store. Once a the content is available there we can run analytics jobs to retrieve information from the raw data. Check part II of this series to see a way to accomplish it.

--

--