Telemetry & usage data collection with Azure ETL tools (Part 1)

Shay Schwartz
11 min readAug 31, 2020

--

Microsoft ETL and Workflows tools: Azure Logic Apps & Azure Data Factory

Target Audience: Developers & Dev-Ops who are interested in understanding Azure ETL tools — Azure Logic Apps and Azure Data Factory. The article will discuss ETL and focus on ETL for Logs, Telemetries and usage data collection.

Overview:

A cloud service should be alive, healthy and responsive 24 x 7, provide good performance and low latency. The product should be monitored to achieve high responsiveness and availability.
The team should be able to understand the origin and impact of any issue as quickly as possible.

In addition, stakeholders, PMs, and developers count on usage data, in order to understand which features are being used and how, and use it to get data driven decisions regarding ongoing features and priorities of new features. Usage data also required in order to verify the different system KPIs.

Here are some of the challenge we are using when collecting telemetries and usage data:

  • Large Amount of data
  • Data should be collected from different components using different technologies
  • Data might be stored in different formats over different data sources and geo locations
  • Data reliability completeness is highly important
  • Data should be stored affectedly to read

The focus of this article is about collecting and processing the data from the various sources.

How ETL tools assist with telemetries and usage statistics?

ETL and Data migration tools assist with building a pipeline (or workflow) for copying data from different sources, transform the data format, enrich it, aggregate it and eventually store it in a single place and format. In addition those tools have strong abilities to monitor the pipeline execution, report errors, handle retries and rerun. We are getting all of those benefits without writing any code. It’s hard to find one size fits all tools, so those tools should provide flexible extension points.

Azure provides a few advanced SaaS tools to do just that. Those tools provide a huge amount of “out of the box” abilities that assist with our mission.

In this article we will cover Azure Logic Apps and Azure Data Factory and understand how we can use each of them to manage our telemetries and usage data processing.

What are the requirements from such tools?

In a nutshell, here is a list of requirements that such a data pipeline tool should provide.

  1. Built-in support for multiple data sources.
  2. Data pipeline / workflow management
  3. Triggers for pipelines
  4. Handling large amount of data
  5. Out-of-the-box functionality VS. Running custom code
  6. Monitoring and error handling
  7. Security
  8. Deployment — Manual and continuous delivery using infrastructure as code

(Sections 6–8 will be discussed in Part 2).

About Azure Logic Apps

Azure Logic Apps is a cloud service that helps you schedule, automate, and orchestrate tasks, business processes, and workflows when you need to integrate apps, data, systems, and services.

Example: When a new report file is created in OneDrive, parse the file, save it into DB and send an email automatically summarize the report.

Logic Apps — When file create process, send to queue and send email

Logic apps is a powerful platform that supports hundreds built-in connectors provide triggers and action for specific product:

  • Databases connectors
  • Storage connectors
  • Flow Management connectors
  • Transform & Parsing Connectors
  • Connectors for Ticketing systems and other 3rd parties

About Azure Data Factory (ADF):

Azure Data Factory is Azure’s cloud ETL service for scale-out serverless data integration and data transformation.

Using ADF you can easily create and manage pipelines: a set of actions that define a complete process of copy, transform, enrich, filter and aggregation of data. To be honest, it’s not exactly a set, but more of a DAG in which actions might be dependent and can run in serial or parallel.
Advanced or heavy procedures can be executed using SSIS, Azure Databricks, Spark, and others.

ADF provides connectors for over a 100 databases, file storage systems, services and apps.

Example for an ADF pipeline:

ADF DAG — Copy logs from various sources

Azure Logic Apps VS. Azure Data Factory — choose the right one for Telemetries and Data Usage

Built-in support to multiple data sources — for different clouds and on-premise

A cloud solution composed of different components, built by different teams and groups. Each one keeps its telemetries in different places. Some of the components might be executed on customer’s on-premise or clouds of different vendors. The first step in building telemetries and data usage solutions is to collect all data from logs into a central repository.

It doesn’t matter if the logs are in CSV files, RDBMS, noSQL DB or other service, our solution should be able to connect and collect the data.

Both solutions (Logic Apps & ADF) have many built-in connectors designed to retrieve data from any type of database, blob storage account, one drive, google drive and others on Azure cloud, AWS cloud ,Google cloud and others . In addition, both solutions support generic data sources with ODBC, ODATA, RestAPI.

In order to access on-premise data Logic Apps provides “on-premises data gateway for Azure Logic Apps “ and ADF provides self-hosted integration runtime.

In the next step, we need to make format transformation on the data, and enrich our telemetries and usage data with information from different services (for example: customer name or geo-location information).
To support such scenarios both solutions offer a wide range of connectors. Logic Apps offer more connectors than ADF. ADF on the other hand, supports working with Spark, Map Reduce, Hive, and machine learning tools.

Data pipeline — workflow management

According to Wikipedia :

Data pipeline is a set of data processing elements connected in series, where the output of one element is the input of the next one. The elements of a pipeline are often executed in parallel or in time-sliced fashion.

Pipeline can also be described as DAG in which the actions are the nodes, and the edges are the dependencies between actions which describes the complete workflow.

In this article I will use Pipeline and Workflow as synonyms.

To prepare our telemetries and usage data we need to design a data pipeline with the following steps:

  1. Copy Data to a central DB — collect all telemetries
  2. Data transformations — normalized the data to single structure
  3. Data Aggregation — reduce the storage size + best suitable for information retrieval.
  4. Enrichment — add additional metadata for later use. For example — client Geo-location.

Each of the above steps consists of one or more actions. Action is the most basic unit of work. An action for example might be: copy data from table X to table Y.

It’s highly important that our telemetries will be reliable. Data must be accurate and complete as possible. Only when all actions are completed successfully, in the right order, we are certain about the success of the data processing, and mark the pipeline execution as done. Next article (Part 2) will discuss how to monitor and handle failures.

Both solutions, Logic Apps & ADF provide a wide set of sequence management actions, including loops and conditions, serial and parallel execution and great visualization tools to build and monitor the sequence.

Two differences should be pointed in this section:

  1. Logic App action is usually about single operation like executing a read query or writing a row. While ADF supports more massive actions like copy and transform of large sets of data in single action. It may sound like a small difference but it makes our life much easier for our purpose. (More details in the “Handling large amount of data” section)
  2. When it comes to large workflows, IMO, ADF pipeline provides better sequence management, which is more flexible and clear. Both Logic App playbooks and ADF pipelines can be broken into smaller units that can execute each other, passing parameters to both ways. But when you are splitting the playbook or pipeline you will lose the context and it will be less convenient for tracking and debugging.

See following Logic App playbook example:

Logic Apps — a pipeline triggered from sliding window trigger to read data from SQL & Kusto, transform it and load

ADF designed as a pipeline management tool. As such it fully supports workflow management — Actions for ETL, dependencies between actions, conditions and loops over operations. It’s visualization is very clear and helpful even from complex workflows.

ADF pipeline For example:

ADF —Move data between tables, manipulate it, and send to machine learning tools

In the above example I have used Copy data action, Mapping Data Flow, and Machine Learning action.

Data flows allow us to develop data transformation logic without writing code. The resulting data flows are executed as activities within Azure Data Factory pipelines that use scaled-out Apache Spark clusters.

Here is the data flow action I have created for that task (Merge T2 and T4 into T5):

ADF Data Flow — codeless data transformation

Pipeline trigger:

A pipeline execution can be time driven or event driven. For telemetry and usage data we would prefer a time driven execution, for example — every hour, a day or a week.

In most cases we would like that the pipeline would be executed on sliding time windows. Telemetries usually require high frequency of executions, for example every 5 minutes, while for usage data once a day is more than enough in most cases.

Please keep in mind that there is the start time and end time of the processed data (a “logical time”) which are different from the real execution time of the pipeline. For example, if our pipeline is executed each night at 02:00 AM and runs for 15 minutes, the pipeline execution time is 02:00 AM — 02:15 AM. The data time may start at midnight the day before, and end at midnight today.

Both Logic Apps & ADF support schedule trigger, sliding window trigger, event based trigger, and HTTP trigger.
Logic App
provides dozens of event based triggers while ADF provides only Azure Blog files event triggers. For telemetries and usage data we would use mostly the schedule trigger or sliding window triggers.

Handling large amount of data

Let’s say that our enterprise cloud service retrieves and processes 1 Million events per hour, per customer. Let’s also say that we have 1,000 customers. So it’s safe to say that our enterprise cloud service generates billions of telemetries per hour. How could you collect, process, transform, enrich, aggregate and persist such large amounts of data?

Lucky for you, Azure services provide as much compute and storage as you can ask for. But still, it’s your responsibility to design efficient and cost-effecting pipeline, based on following recommendations:

  • Aggregation Queries: If possible, consider to save aggregated data instead of row data.
    Aggregation queries may reduce the amount of data from billions of rows to a few thousands, saving storage and processing cost and time.
  • Using a single action to read and write — for example, SQL “insert into select” query would be more efficient than using one action to “read all the data”, second action to “transform all data”, and third action to “write all data”. This solution is possible in some scenarios, dependent on the source and destination of data, and user permissions.
  • Paging / Batching — extract-transform-load for small amount of rows — to prevent timeouts and use limited amount of memory.
  • Integration Runtime — The Integration Runtime (IR) is the compute infrastructure used by Azure Data Factory to provide the data integration capabilities. An integration runtime provides the bridge between the activity and linked Services. It provides the compute environment where the activity runs on. This way, the activity can be performed in the region closest possible to the target data store or compute service and provide best performance and security needs

Out-of-the-box functionality VS. Running custom code

While developers like to design and implement their own solutions exactly as they imagine, there are proven benefits to use state of the art frameworks and minimizing the custom code writing.

Using frameworks & proven tools speed up the development process, reduce bugs and cost of maintenance and save time and eventually money.

ETL / data migration tools should provide common built-in functionality so we will be able to focus on our business logic. Built-in functionality includes:

  • Flow management — conditions, loops, split flows to parallel steps or merge them back into serial steps
  • Quick & easy integration with multiple data sources and APIs
  • Handle massive amount of data
  • Data Transformers
  • Scheduling, Retry & Error handling

There are cases when the built-in functionality is not enough, and we must extend it using our own logic. For example, in one of my former companies, we have developed an analytical engine. Our pipeline includes executing our own algorithms used to analyze the data and rank user’s scores. The system design was based on a workflow pipeline that synchronized all the different tasks, but each task was implemented using small units of Java code.

When discussion on ETL for telemetries & usage data, built-in functionality should be enough and no code required. However, we still want our tools to support it for the rare cases that may require such custom code. For example, one of my colleagues had to parse the content of HTML emails as part of the pipeline. To implement this scenario, my colleague chose to write logic in python, using Azure Function and used it from within the Logic App playbook.

Here are some options to execute your own code in Logic App & ADF:

Summary and Recommendations (for Part 1):

Both tools are impressive and provide strong capabilities in creation, monitoring and executing workflows. Both tools have built-in integration with a lot of data sources and 3rd party applications, and the amount of actions and connectors just keep growing over time.

Both tools provide rich visual tools for creation and monitoring of workflows, and also can be fully managed through PowerShell, and ARM Templates.

User permissions and access to resources and data sources supported in all Azure standard methods and best practices (spoiler for Part 2).

Prefer to use Logic Apps when:

  • The trigger for the playbooks or pipeline is event based (I.E. new email sent, new event sent to queue).
  • Your pipeline requires integration with many 3rd party applications. It’s more likely that those 3rd party applications have connectors for Logic Apps.

Prefer to use ADF when:

  • Your pipeline needs to copy & transform large amounts of data.
  • Your pipeline includes a long list of actions that should be done. ADF provides a better way to track and define the relations between the different actions.
  • Your pipeline requires integration with machine learning abilities.

For our use-case, telemetries and usage data, we would prefer ADF due to our large amount of data that needs to be processed. Azure Integration Time provides compute on demand to handle that large amount of data, and ADF allows us to control the data read/write batches in an easy way.

ADF was designed to solve ETL problems and as such it is better appropriate to the task.

Both tools, Logic Apps and ADF are complementary products that can execute one another and use the other product results as part of it’s pipeline.

In the next article (Part 2) we will discuss production aspects of Logic Apps & ADF

--

--