Re-platforming from Hadoop based Systems into Azure Part 1

Lackshu Balasubramaniam
5 min readAug 8, 2021

--

Introduction

I’m penning some thoughts around migration from Hadoop into Azure after a recent project where I helped design and implement the frameworks for re-platforming an Impala on Cloudera Data Platform into Azure. This is to share my learnings and thoughts around lift and shift approaches.

High Level Architecture of Azure Target Platform

In the example above we have various components which could be implemented in a target platform. These are not recommendations and are for illustrative purposes only. The components would differ by the needs of the business and project.

Some examples of different approaches could be:

  • the pipeline could be implemented end-to-end on ADB.
  • event hub or other queueing mechanisms could be in place to pass events across layers.
  • the data serving layer could be Lake House, Hub-Spoke DW or a Synapse DW + Analysis Services approach rather than purely Synapse DW.

Considerations are as follows:

  • Metadata Driven Approach vs Custom Approach
  • Orchestration
  • Logging
  • Storage
  • Ingestion
  • Transforms including SQL Language Conversions
  • Data Volume
  • Data Delivery
  • Testing the Platform

I’ll cover the considerations as a multi-part article starting with this article.

Metadata Driven Data Engineering

I tend to prefer metadata driven approach because it allows for quicker delivery.

The clients I have known tend to use boilerplate approaches in their ETL/ELT processes and it’s possible, even recommended to leverage configurations from data engineering processes in the past.

Custom code does not scale well as the number of source systems and source entities increase. Also there would be boiler plate code at various levels of maturity implemented across different sources and entities, which is not ideal.

The key elements of metadata driven configurations are

  • modeling the source systems and target systems.
  • modeling the source and target entities at most layers if not all.
  • the transformations required at the different layers which would facilitate the data engineering processes.
  • schema of entities at the different layers which would allow for validation and schema-on-write at the refined layer.

Orchestration

I feel Azure Data Factory (ADF) pipelines serves well as an end-to-end orchestration layer.

ADF has a rich set of control flows and has excellent integration with Azure Data Bricks (ADB). This allows for passing parameters from ADF into ADB and passing events back into ADF from ADB. If Synapse is the target data repository we could also integrate into Synapse via ADF or ADB.

Paired with metadata driven approach the orchestration pipelines could be generalized to handle different sources and scenarios across layers.

Triggering of pipelines could be schedule based or event-driven. The approaches are not mutually exclusive and can be used in combination within the platform.

Schedule Driven

For database sources, schedule based pulls work well because we would have an idea of when the source entities would be ready for pick up or quiet times.

There’re also opportunities to setup tumbling window triggers when there are dependencies between pipelines or self-dependencies.

There’re exceptions where we could poll a database(DB) for the readiness of the tables/views before pulling the data i.e. after snapshots. However in this case the signal would be better delivered via a queue rather than polling the DB regularly over a period of time before timing out and raising alerts.

Event Driven

ADF also has the ability to be triggered when files drop into a BLOB Store or ADLS location i.e. storage event trigger. This proves to be useful when we want to trigger a master pipeline when files land. There’re options to build custom triggers and this might be worth exploring.

Queue Driven

Since this is a publish-subscribe pattern, as an entity becomes ready for processing the source or event-triggered pipeline could drop the event into a queue.

ADF can then run a master pipeline at regular intervals which then retrieves an event from the queue and kicks off child pipelines based on the event retrieved from the queue.

The scenarios where events get dropped into a queue before they are processed would be ideal for

  • reducing direct dependencies between pipelines
  • maintaining a predictable capacity as we process entities.

The queue could be Event-Hub based or any other component that’s amenable to maintaining queues reliably.

Dependency Driven

The orchestration can get complex further downstream. Generally speaking we need to have multiple entities land in structured zone before we can start building an entity in the refined zone. Thus the process to enrich the data and perform calculations/aggregations would kick-off based on dependencies. Think of this stage as building the dimension or fact tables.

Thus we need to maintain a register of items that had arrived for the day via logging and the process could be kicked off when the prerequisites had been fulfilled.

Logging

The purpose of logging is two-fold, we need to be able to troubleshoot as processes breakdown, we also need to orchestrate based on events that had occurred.

Monitoring and Alerting

Ideally there should be an enterprise logging capability we could leverage so that we can raise exceptions as issues occur which would flow into alerts to the operations team. If it’s REST API based, it would be easier to integrate into from ADF. The ADF pipeline could then log success or failure as key stages of the process complete for each entity.

Azure Monitor and Azure Log Analytics is one possible capability to leverage if an enterprise capability does not exists. There are also options to monitor telemetry data from various other components.

Orchestration

There is an option to log events that we need for orchestration purposes in Azure SQL DB or Event Hub. I tend to prefer Azure SQL DB as it’s easier to work with and most developers would find it easier to troubleshoot issues via log tables in Azure SQL DB. I feel Event Hub works better at the ingestion stage.

The log tables would track the entity movement across zones (including retries) in Data Lake and into Synapse. This would then allow the data engineering framework to move the entities through the various stages of processing an entity.

The logs for structured to refined would be especially important as this is the stage where various entities converge into one entity. Essentially the refined zone would be driven by dependencies to build the refined entity.

Next section in Part 2.

--

--

Lackshu Balasubramaniam

I’m a data engineering bloke who’s into books. I primarily work on Azure and Databricks. My reading interest is mostly around psychology and economics.