Data Lake Workloads Synchronization Using Google PubSub

Published in

Walmart Global Tech Blog

5 min readApr 7, 2022

This blog was co-authored by Sridhar Leekkala & Rohit Bhogaraju

Relay race depicting workload dependencies

As data engineers at Walmart Global Tech, we run multiple workloads in our environments at a large scale, processing more data than ever before. Like in a relay race, we watch over numerous source team workloads and wait for their completion. Once the baton is handed over to us, we initiate our workloads, then hand over to the next team whose work is dependent on our data. There is always a race to meet Service Level Agreement (SLA), and these workload interdependencies play a crucial role on data integrity and the timely delivery of data to the business. We will discuss common patterns followed to acknowledge these interdependencies and how our workflow synchronization framework enables us to address some key challenges.

Expectations for the workload synchronization

These days, data is treated as a product. It should be available at the right place at right time, and there is huge shift towards a self-serving data infrastructure, connecting data lakes through a data mesh. With these distributed architectures, workloads are more connected than before. Workloads need more information about the data availability rather than an acknowledgement based on a touch file. Moreover creating, watching and managing touch files specific to a workload are challenging at scale.

Another common practice is to have time-based scheduling. Most of the time we see workloads complete well ahead of schedule. If the dependent workloads are notified as soon as the parent workloads are complete, then we can deliver business-critical reports ahead of time. Sometimes, data refresh is notified through email communications and the dependent teams will follow up with the emails and initiate their workloads.

Applications and BI tools demand push-based notification to initiate data refresh rather than polling throughout the day. At the same time, we need to be able to store the synchronization data so we can run analytics to optimize the workloads.

To address above requirements, we built an event-based messaging framework wrapped around Google PubSub to synchronize the workloads irrespective of the schedulers or applications on which they are running. The following sections detail the framework.

Workflow synchronization framework

Workflow synchronization framework is built on the idea of abstracting the messaging middleware and enabling producers and subscribers on different platforms to synchronize their workflows. It provides flexibility to onboard projects managed by the message producer teams and create topics. For example, Project A and Project B can be owned by different teams and each one of them can create their own topics, manage the subscriptions and define access controls. The following paragraphs provide more details on each component of the framework.

Message producers

Message producers publish the messages to the topic through the REST API. An API proxy will redirect the messages to the relevant topics created on different projects. Producers can be self-onboarded by registering in a service registry and they will be assigned a unique consumer id, which must be passed along with additional headers of the REST API calls. In addition, the service registry provides capabilities to add policies on incoming throughput, latency, etc. for the incoming requests.

The request then passes through a publisher abstractor, which intakes the request, validates it and posts the message to the topic through Pubsub API. Once the message is posted to the topic, the API returns a message id as an acknowledgement, and the messaging framework takes care of sending the messages to the subscribers.

Message subscribers

Subscribers can consume messages through REST API, or they can view in the client libraries. Just like message producers, subscribers can also be self-onboarded to the list of topics through service registry. On registration, a subscription will be created for that topic, and each subscriber will be assigned a unique consumer id. During onboarding, subscribers have the option to configure the subscription as shown below.

A sample subscription configuration for a daily load

Message format

Message format (schema) acts as a contract between the producers and consumers and evolve as required. The framework provides the list of topics available and its message formats, which help subscribers apply filters and parse the messages at their end.

{
 “data”: {
 “schemaName”: “sales_dl_rpt”,
 “tableName”: “item_sales”,
 “status”: “completed”,
 “loadType”: “incremental”,
 “loadedTime”:1641270222
 },
 “attributes”: {
 “source”: “datalake”,
 “subjectArea”: “sales”,
 “storageLayer”: “object store”
 }
}

The above sample message format shows that a particular table is loaded. The message itself has the details on schema, table, status and loaded time. There are few fields inside attributes, which can be used by the subscriptions to filter the messages as needed. The framework doesn’t enforce schema while creating a topic. The message structure can also be defined in the producer abstraction layers.

Monitoring and logging

The framework provides a centralized place to monitor the topics, messages produced, messages subscribed, lags on projects and audited operations, such as creation of topics, data access and system events. Teams can also build their own monitoring dashboards and consoles to check logs.

Moving forward

We’ve used this framework for a while and have successfully synchronized the downstream team workloads and applications as the workloads completed from the source. The web portal helps producers and subscribers self-onboard, and as a next step, we’re adding new features to provide visibility on the lineage of the workloads.

Data Lake Workloads Synchronization Using Google PubSub

Expectations for the workload synchronization

Workflow synchronization framework

Written by Keerthipriyan