Re-platforming from Hadoop based Systems into Azure Part 2

Lackshu Balasubramaniam
5 min readAug 13, 2021

--

A photo taken from Great Ocean Road by Tim Hart

Introduction

This article is a follow up to my first article on re-platforming from Hadoop to Databricks on Azure. In this article I’ll cover the following considerations of the data platform.

  • Storage
  • Ingestion
  • Transforms including SQL Language Conversions

Storage

Storage is a central piece of the puzzle and some key decisions need to be made. One set of decisions is about zones within ADLS that store entity data at different stages of the transformations. These decision are also driven by usage patterns.

Secondly some thought needs to go into the Data Lake Taxonomy to define how the folders and files are laid out in each zone. The design decisions around taxonomy has implications further down the road. Thus, careful consideration is required around the ingestion, transforms and usage patterns.

More on this in an article I wrote sometime back around Design Patterns for Data Lakes. As a starting point I generally feel ADLS should be laid out into the following containers as Zones.

  • Pre-Parse
  • Raw
  • Structured
  • Refined and Trusted

In Databricks parlance Raw would be Bronze, Structured would be Silver and Refined would be Gold.

Another recommendation I have is to leverage Databricks Delta format in Structured Zone and downstream. Parquet format should only be used when required i.e. compatibility with external systems which consume the data.

In my opinion, three key benefits of Leveraging Delta Lake are

  • ACID Transactions
  • Performance
  • Time travel

The time travel capability of Delta is worth exploring as it allows for easier rollback and is a big time saver.

Ingestion

There is a question of whether we should ingest data via Azure Data Factory (ADF) pipelines vs Azure Databricks (ADB) pipelines. I tend to prefer Azure Data Factory (ADF) for cases where the connectors are available. Reasons are:

  • repeatable patterns driven by metadata
  • ability to generalize
  • visual pipelines that are more navigable

Since ingestion pipelines could be generalized and simplified via metadata driven approach, you would end up with master pipelines to handle generalities and worker pipelines to deal with specific cases.

It is simpler to edit and manage ADF pipelines as opposed to using hand coded approach. Also Azure Data Factory tends to do well with databases and files. There is some work required when hitting Web Interfaces e.g. REST APIs but this could be mitigated via metadata-driven approach.

On the flip side ADF doesn’t do well with large and/or complex XMLs. Also, there could be scenarios requiring custom interactions where ADF falls short. There are options to go down the path of Azure Function activity, Web Hook activity or custom activities for these cases. However ADB would work better in this instance as it’s free-form.

Ingestion wise, Databricks is more versatile and has a rich set of JDBC connectors. The disadvantage is the amount of coding required to accomplish some tasks. For example some JDBC libraries only return Java resultset when queries are applied, which would then have to be converted to Dataframes.

One other consideration is if the dataset being ingested is very large and has significant skew. In this case Databricks could be considered as an option for ingestion as it has more parameters to tweak to deal with skew. This could be on an exception basis.

All in all it’s a balancing act to figure out what works best given the different scenarios in the project.

Streaming Data

When it comes to streaming data I tend to fallback to Azure Databricks (ADB) based ingestion when Databricks is the transformation engine. It’s more natural to work with streamed data end-to-end within Databricks with ADF playing a purely orchestration role. The exception would be when Stream Analytics comes into picture.

Pre-parsing

There are also other exception cases where Azure Databricks (ADB) performs ingestion rather than ADF. This is usually when we need leverage ADB for the following reasons:

  • strip content from or pre-process flat files
  • pre-process complex XML files
  • when the service being connected to is a legacy system which ADF doesn’t have connectors for.
  • when we need to interact with a service which requires a handshake process while authenticating that ADF doesn’t support.

Note: Examples of pre-parsing is when we receive large XML files and need to to pull out only relevant details. It could also be cases like stripping out non printable characters from flat files or handling for data formats that are non-standard.

Transforms

For heavy lifting which most transforms tend to be, Databricks is my preferred core engine. While ADF has Data Flow which is a possible option, I gravitate towards ADB as it’s a mature and proven platform. Azure Synapse Analytics is another option that could work for scenarios where the client is geared towards a pure Microsoft architecture.

ADB can also retrieve metadata and be driven by metadata on the decisions around transforming the data from raw to structured zone. It does get more complex in terms of defining metadata for transformations from Raw to Structured and downstream. It would be beneficial to use pre-existing configurations if any from existing platform in the metadata, as they would have stood the test of time.

Raw to Structured Load Modes

For raw to structured loads, there would be 3 general classes of transformations based on the data being ingested for the entities. The ADB worker notebooks would generally be divided along this lines with a master notebook which would do a metadata lookup to decide which process applies to the entity being processed.

  • Incremental
  • Full
  • Filtered Subset

There could be some special handling for specific entities and this could be handled as such in the generalized notebooks. However care should be taken to ensure the code doesn’t get cluttered and the logic flow doesn’t get broken as a result.

Structured to Refined Transforms

Structured to refined transforms would generally be more free-form unless the processes could be templatized. The structure of the notebooks could be templatized based on orchestration patterns, however, unless the refined tables follow a set pattern it would be fairly difficult to have the transformations for entities modeled as metadata driven transforms.

SQL Language Conversions

Structured to refined transforms tend to be Spark SQL heavy as lift-and-shift projects would bring across existing SQL code and attempt to convert them to Spark SQL.

While Spark SQL does have the same syntax as HiveQL or for the matter Impala, there are differences in how the functions work and how some commands work; in some cases equivalent functions or syntax won’t exist in Spark SQL. There are also some differences in time/date functions which could get nuanced. It would be good to flush this out early and document the differences for developers to reference.

Next section in Part 3.

Areas to be covered are

  • Data Volume
  • Data Delivery
  • Testing the Platform

--

--

Lackshu Balasubramaniam

I’m a data engineering bloke who’s into books. I primarily work on Azure and Databricks. My reading interest is mostly around psychology and economics.