How to chain and observe interdependent Data Products on Google Cloud

Yunus Durmuş
Google Cloud - Community
3 min readAug 24, 2023

authors: Connor Lynch, Yunus Durmus

In this blog post, we will show you how you can create a distributed workflow architecture on Google Cloud where Data products that are owned by several different teams can trigger each other.

The image below shows an example case. We want to observe customer behaviour by correlating their profile and purchases. We are a large company with online and store sales data from several countries. Several teams, distributed on different countries pull their relevant data via ETL jobs and make them available as data products. The problem is that these DPs depend on each other so they should run their ETL jobs based on the refresh cycles of others. Upstream DPs may fail or incur delays. How can the downstream jobs synchronise? How can the distributed teams be aware of the problems of the ETL jobs of others?

Data products are refreshed with different schedules. Aggregated data products should update themselves based on the fresh data.

Distributed Orchestration for Interdependent Data Products

To solve the above issues, we need an orchestration system. What features do we need from such a workflow?

  • Teams should be able to work independently.
  • Events should follow a standard so that integrations work easily
  • Some DPs need immediate updates about events (streaming) while some DPs need to check historic events (batch/poll)
  • For observability, operations team should be able to check the status of relevant data product pipelines.

The publish-subscribe pattern is a great fit for creating a distributed orchestration. Teams work independently, they only publish their own events and interested parties subscribe to these event messages. In Google Cloud we have even a product with the same name! Google Cloud PubSub.

Data Product teams send their updates to a central PubSub topic which is managed by a central platform team. Downstream Data products consume relevant messages from PubSub after filtering. The messages are also stored in permanent storages for dashboards or for DPs that poll earlier messages.

As shown in the above image, PubSub is at the core of the architecture. Teams publish their events to a shared global PubSub topic with attributes and ordering keys. Interested downstream product teams create their own subscriptions with filtering so that they only receive relevant messages.

An SQL database for history and polling. Some data products may depend on several others. In that case, they need to store state of each data product before performing any calculation. We propose to have a permanent storage for all the history. Data Products can poll this database to check the historic data. This can be any operational database like Spanner or CloudSQL.

Common message format. A single message format for all the teams enforces standardisation and increases the adoption by shared libraries.

APIs for event ingestion and consumption. Although the DPs can interact with the backend services directly, an API for ingestion to PubSub and reading from database adds an abstraction for long term changes. Platform team can apply versioning easily by introducing an API.

Observability via Looker/Looker Studio. A Looker dashboard on top of the data from SQL database let all the teams debug the pipelines. When the owners of Customer 360 suspects an issue with a report, they can quickly check upstream pipelines via these Looker dashboards. Moreover, since the messages store source systems, we can display lineage of data products from Looker as well.

Central Platform team owns the infrastructure. The architecture is distributed and allows teams to work independently. But the PubSub topics, APIs, database and even the message format should be owned and maintained by a platform team.

To conclude, with this distributed orchestration, we can easily chain data products. DP teams may trigger their pipelines by using Cloud Functions/Workflows that are triggered by PubSub. We also increase observability. You may even go further and create alerts when an upstream pipeline fails again by triggering Cloud Functions from PubSub.

If you want to learn more about implementing Data Mesh on Google Cloud, check out this white paper from my colleagues.

--

--