Telemetry & usage data collection with Azure ETL tools (Part 2)

Shay Schwartz
6 min readSep 6, 2020

--

Microsoft ETL and Workflows tools: Azure Logic Apps & Azure Data Factory

Following previous article (Telemetry & usage data collection with Azure ETL tools Part 1), this article will discuss the production aspects of our ETL tools. Aspects like monitoring, deployment and security.

What are the requirements from such infrastructure?

In a nutshell, here is a list of requirements that such a data pipeline tool should provide.

  1. Built-in support for multiple data sources.
  2. Data pipeline / workflow management
  3. Triggers for pipelines
  4. Handling large amount of data
  5. Out-of-the-box functionality VS. Running custom code
  6. Monitoring and error handling
  7. Security
  8. Deployment — Manual and continuous delivery using infrastructure as code

Sections 1–5 has been discussed on Part 1.

Azure Logic Apps VS. Azure Data Factory — choose the right one for Telemetries and Data Usage (Cont.)

Monitoring and error handling

As discussed in previous sections, having reliable — full telemetries is critical for a cloud service. If the telemetries are not trustworthy we are not aware of the real system health. Usage statistics might be less critical, but without reliable data our decision would not be good enough.

Therefore, the telemetries & usage data pipeline must be also monitored.

If one or more action failed, We expect workflow tools to support the following abilities:

  • Retry mechanism for an action: overcome temporary downtime.
  • Rerun on a date or range of dates — if we discover issues with the data, or lost data, we probably want to fix the origin of failure and rerun the lost day(s). Before rerun a day that is already executed, please remember to remove or mark the old data of the same date to prevent duplicate rows.
  • “Debug” tools — visibility for what was happening during the pipeline execution, including execution parameters, action results and point of failure, and error message.
  • Dashboard, Monitors & Alerts — report on failures, send email or open a ticket in your ticketing system like Azure DevOps, Jira or ServiceNow
Compare Monitor, retry and rerun abilities

Security

In this section we will discuss two types of permissions:

  • Pipeline management and execution permissions.
  • Access to data sources and remote APIs

Pipeline management & execution permissions:

Both Logic Apps playbooks & ADF are Azure resources. Any access to those resources protected by Azure role-based access control (RBAC). A user or service that would like to take any operation (view, edit, execute) from Azure Portal or Azure Resource Manager required to have relevant roles.
Role is a set of permissions required for Azure Resources.
For example, You can find the following roles:

  • Data Factory Contributor
  • Logic App Contributor
  • Logic App Operator

One difference between Logic Apps playbooks & ADF is about the scope of the permissions. While each Logic App playbook is a resource in the Azure, In ADF, A single ADF may contains multiple pipelines. As a result, system administrators can assign different users to different playbooks, but all pipelines in a single ADF will be used by the same users.

Access to data sources and remote APIs:

In order to access databases, account storage, ARM API or external APIs the playbook or pipeline must be identified with some credentials. There are few supported authentication methods. In addition all passwords / secretes / certificates must be stored in a secure way.

In ADF you will create a linked service per each data source. Each linked service configuration holds the connection details and credentials. Depended on the connector you may asked to provide one of the following credentials:

  • Service Principal / Managed Identity (Azure)
  • Access Key (Amazon)
  • SQL Authentication or Connection String (databases)

Note 1: The secret (or password) should be stored in Azure Key-Vault, ADF should not store the password directly, but to access the Key-Vault in runtime to retrieve the secret.

Note 2: When using Web action to retrieve or send data from within ADF you can use basic authentication method, managed identity or Client Certificate.

In the Logic Apps playbook each action supports one or more authentication methods. By default, if not specified, the credentials are the credentials of the user which created the playbook.

Other options are (dependent on the selected action):

  • Service Principal / Managed Identity (Azure)
  • SQL Authentication or Windows Authentication (databases)
  • OAuth 2, SAML 2.0

Note 1: When use HTTP action to retrieve or send data from within the playbook you can use basic authentication method, managed identity, Active Directory OAuth or client certificate.

Note 2: Logic Apps provide a Key-Vault connector which support “get secret” action. Using the “get secret” action your workflow can get the required secret in a secured way, and use it to connect to your data sources.

Deployment — Manual and continuous delivery using infrastructure as code

Both Logic Apps and ADF provide great visual user experience. Create workflow by drag & drop, very easy, very quick doesn’t leave room for mistakes.

All configuration management and workflow management are available through the Azure Portal. In addition both tools provide visual monitoring and “debugging” tools to track history execution and understand failures.

ADF also supports optional use of Git repository, for history and CI/CD. Any time the user hits “save”, the change is saved as ARM template files in a collaboration branch, and when the user clicks “publish” the ARM template updates into the publishing branch. That way you can track changes history and use GitHub actions for ADF deployment.

Manual management through the portal is a great way to start with, but if you have more then one environment you probably want to develop your pipelines on develop environment, test them on staging one, and then deploy them into production environment(s) as part of your CI/CD pipeline.

According to Wikipedia:

Infrastructure as code (IaC) is the process of managing and provisioning computer data centers through machine-readable definition files, rather than physical hardware configuration or interactive configuration tools.

Azure offers IaC using ARM Templates. Both tools Logic Apps and ADF fully support ARM templates, including deploying from ARM Template and Export resource into ARM templates. This is the recommended and safe way to deploy resources in Azure.

Both tools support also working with PowerShell through New-AzureRmLogicApp module, and Az.DataFactory module.

I would recommend starting with the Portal and after you get the result you want, you can move forward to work with ARM templates.

Summary and Recommendations:

In this article, we found out that both solutions provide state of art abilities to secure the workflow and access to data sources. Methods like basic authentication, Roles based access control (RBAC), Managed Identity, service principals, OAuth, and SAML. In addition ADF allow storing the secrets in Azure Key-Vault, which is the recommended way to store secrets in azure. Logic Apps support Key-Valut connector that can be used for the same purpose.

Both solutions could be deployed through ARM Templates (Infrastructure as Code setup) and through PowerShell scripts, but also provide great UX through Azure PortaL.

ADF resources can be persisted in a GIT repository to keep versions history.

In the section “Monitoring and error handling” we have reviewed each solution’s support in monitoring, debugging and rerun abilities. Both solutions answer those requirements and provide a good answer to those needs, but IMO, ADF has better dashboards and alerting mechanisms to notify when errors take place. ADF also supports rerun on time, which could be sometimes required.

In Part 1 summery we have concluded that:

Prefer to use Logic Apps when:

  • The trigger for the playbooks or pipeline is event based (I.E. new email sent, new event sent to queue and so).
  • Your pipeline requires integration with many 3rd party applications. It’s more likely that those 3rd party applications have connectors for Logic Apps.

Prefer to use ADF when:

  • Your pipeline needs to copy & transform large amounts of data.
  • Your pipeline includes a long list of actions that should be done. ADF provides a better way to track and define the relations between the different actions.
  • Your pipeline requires integration with machine learning abilities.

For our use-case, telemetries and data usage, we would prefer ADF due to our large amount of data that needs to be processed. Azure Integration Time provides compute on demand to handle that large amount of data, and ADF allows us to control the data read/write batches in an easy way.

ADF was designed to solve ETL problems and as such it is better appropriate to the task.

--

--