Breaking (for good) a data pipeline : Part 1

Bharani
4 min readJul 18, 2024

--

Introduction:

In one of the projects I am working on, its a typical ELT, Lakehouse architecture workload built on Azure. The focus point of discussion for this blog is the 3rd party API provider (highlighted in yellow in the architecture diagram below) and how it creates complexity in the data pipeline to split it into two parts.

High-level architecture

Situation:

One of the enrichments in this project is to use a 3rd party API to add some data to the transformed data and enrich it for business use. This API at present is transactional, meaning this API is called for each row of data in the transformed data.

The API was built for a different purpose but we are force-using it considering many other business + budget aspects calling it asynchronously from Python code.

There was ask from our team for a batch API which would take all the input transformed data and returns enriched data at once rather than having to call the API for each record, to save on performance, cost and time to process.

Ask has been answered and we are given a batch API which would accept records in batch of 1000 rows per call (if a given ingestion run has more than 5k rows, we have to split the rows and call in batches). Once we call this batch API, we get back a batch id with which we would be able to get the results later when the batch API is done with processing of all the batches for a given ingestion pipeline run.

Why break pipeline into two:

When the API was transactional, though it had its demerits we were getting instant results for each row and the data pipeline runs to completion at one stretch. With the Batch AI Challenge is that:

  1. API requests are queued by the API processor and ran in sequence
  2. Once we invoke a batch, we get back a batch id which has to be recorded in a store in order to identify the ingestion pipeline later when batch results are ready
  3. When the results are ready, we are notified (with batch id) thru a custom webhook service that we have to build and expose to the API provider.
Working of Batch API request — response flow

4. This custom webhook is implemented using Azure Logic apps (Standard)

5. It has an ‘HTTP Request (when a http request is received)’ action / trigger which would act upon invoke to execute subsequent actions in the flow.

6. By default, when you create this logic app, you get URL once you save the workflow. This URL has ‘sig’ query parameter and an associated ‘access key’ which is like a SAS token but not encrypted. Logic app service hence created can be called anonymously without authenticatio, only the ‘sig’ token and the name of the service has to match. This is not secured and requires authentication to secure the custom webhook logic apps(This is what we discuss in part 2 of this blog in detail)

Our data pipeline does not run from start to end at one stretch anymore, because of this batch API, we have to break the pipeline into two. First part is upto the part where we reach to a stage where we invoke this batch api to do required enrichment. After this call, we record the pipeline run in a data store and suspend / stop execution of data pipeline.

Once batch API results are ready, API provider notifies (as in picture above) thru our custom webhook (built using logic apps) and this logic app has action to call databricks notebook as a job to resume remaining (2nd part of data pipeline) steps to complete the full ingestion pipeline run (identified by an unique batch id and pipeline run id).

Coming up:

In the next part of this blog series, we will see how to introduce authentication of this logic app which is what gave me a truck load of headache as there was not much article available in internet.

Securing a logic app with HTTP Trigger action / trigger is easy as it accepts ‘header’ as part of API request. With which we can use either basic auth (user name, password) or Azure AD based auth. But HTTP Request (when a http request is received) trigger, it is not that straight forward.

--

--

Bharani

Passionate about cloud application architecture and data platform architecture