Our first steps towards Data Mesh on Google Cloud Platform

Published in

Nordnet Tech

7 min readMar 28, 2023

Two years ago Nordnet started to architect a Data and Analytics platform on Google Cloud. The year before, the architecture for our operational microservices had been revamped and adapted for GCP, inspired by this Uber article. The engineering organisation was already domain-oriented, with autonomous development teams and a cloud team focused on enabling the development teams through a self-service platform. Our microservices communicated synchronously using REST, but we had started to encourage an event-based architecture. We had already used message queues for asynchronous communication, but now we started to publish messages whenever an interesting event occurred, such as onboarding new customers or opening an account.

Introducing Domain-Oriented Microservice Architecture

Recently there has been substantial discussion around the downsides of service oriented architectures and microservice…

www.uber.com

When taking a stab at the Data and Analytics platform architecture, we immediately felt that we wanted to preserve the distributed, domain-oriented nature of our microservice architecture. Concepts like CI/CD, DevOps and Configuration-as-code were all very well-founded in the minds of the developers building the new Data and Analytics platform. We came from a monolithic architecture and had seen all the drawbacks of it: spaghetti dependencies, difficulties in delivering incremental changes, the list goes on. This was before Data Mesh was a paradigm which would be talked about in conferences and blog posts. When we found Zhamak Dehghani’s paper on Data Mesh, it resonated very well with our initial thoughts on how to move forward building a new Data and Analytics platform on the cloud.

How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh

Zhamak Dehghani Zhamak is a principal technology consultant at Thoughtworks with a focus on distributed systems…

martinfowler.com

We were still a very small group of engineers involved in this work so a true distributed-data organisation felt like a step too early to take. We needed a focus on building the foundations. However, if we just could have distribution and clear dependency tracking in the back of our minds when designing the architecture, then we could more easily move towards a distributed model in the future. In the architecture, we picked up two of the data mesh principles, Data as Product and Domain Ownership. Within those principles we focused on two concepts:

1. Shifting responsibility left (assuming data flows from left to right) and letting data be served to the Data and Analytics platform

2. and making the platform domain-oriented in order to give clear domain ownership of the data.

We considered the GCP stack as more or less a given. Our operational microservices were already running on Google Kubernetes Engine using Cloud SQL and publishing events on Pub/Sub. It was very easy to continue with Dataflow for data processing and BigQuery for storage.

Serving data

We came from an environment where the preceding data warehouse was fetching data using SQL with direct access to tables of the operational system. In the new architecture we let the microservice of our operational platform serve data to our Data and Analytics platform. Shifting the responsibility made it necessary for our development teams to think about what data they wanted to make available and model it thereafter. In some cases, the data model was close to the operational model, but at least the teams gave it a serious thought. We did not call the served data “raw” data, since it’s simply not raw. Simple principles like representing a timestamp as TIMESTAMP and not as integers has been valuable. We also created basic modelling guidelines for the engineering teams to use when modelling their data. Once modelled, we use a BigQuery JSON Schema as the contract. We did try to find a more implementation agnostic schema such as JSON Schema or AVRO Schema, but decided to keep it simple. We made sure to have proper descriptions of each field maintained by the team serving the data.

For our Java services it was easy to publish events on Pub/Sub. However, the same serving approach was used for any system, e.g. we have our JIRA system publishing to Pub/Sub using HTTP/REST, Microsoft Dynamics using the C# client, or a bash script using the gcloud CLI. We picked JSON as the message format since it was consistent with message formats we use in our Java microservices, and so it was a very natural format to use. We considered other formats such as AVRO, but the simplicity and easy adoption of JSON made the choice easy.

Streaming first

Since our initial sources were streaming, we simply continued the same path. The majority of our data is created in real-time, not batch, so why serve it as a batch? Therefore, most services publish the data on Pub/Sub. In many cases, it showed that streaming data is more useful than initially thought of. We discussed whether streaming is really necessary. We do most of the analysis on historical data anyway. However, having streaming data enabled us e.g. to closely follow the rollout of new features.

There is often an assumption that streaming is more complex than batch. However, batch-processing results in a need to coordinate between batch jobs, which is not really necessary in a streaming-only environment. Unfortunately, databases are generally not suitable for streaming processing. Once the data is inside a data warehouse, it’s difficult to process data in a streaming way using SQL. Our solution is to process data incrementally at a high frequency, like every 5 minutes, so effectively micro-batching the transformations inside the data warehouse.

State, events or both

Given our strive towards an event-based microservice architecture, it was easy to premier events over state. Therefore, the primary way of serving data is events on Pub/Sub. However, events are great for time-series immutable data. For slowly changing mutable data there is a need to “bootstrap” the initial state, as events since the beginning of time were not available. Our oldest customers have been with us for more than 20 year and not necessarily have been updated since then, and hence not available as an event.

We introduced a state dump concept which is a batch dump of the current state of all data. It is also an event like others with the same message format just that it is triggered, not by a change in data, but from the system itself either on request or on a schedule. Initially, we published the state dump on Pub/Sub (and still do in some cases), so basically just push out the whole dataset on Pub/Sub. After a while we also introduced a state dump in Cloud Storage, so writing all messages as JSON lines in a bucket. The latter gave more control to the consumer to decide when to load a state dump and also improved performance. However, the logical structure of the message was exactly the same.

Depending on the nature of the data it is either served as state or as events or both. Small slowly changing datasets tend to be easier to serve as state dumps and are done so on a daily basis. Larger datasets are served as events, with a one-off state dump whenever necessary.

Domain-orientation — light

At Nordnet, we are (still) a central data team developing and maintaining the Data and Analysis platform. Even though the data team is not domain-oriented, the sources of data are. It would be a missed opportunity not to let this domain organisation propagate through to the Data and Analysis platform.

Domains

We have created an architecture/structure where each domain has its own module within a central repository. The domains can be deployed independently of each other. While these domains still are maintained by a central team, it’s an easy shift to put the responsibility on the domain team. In practice, this is implemented on GCP as one BigQuery dataset per domain, one service account per domain, etc. We have domains like market-data, insurance and mortgage. However, we found that some data is difficult to categorise into a business domain, e.g. our tracking data spans multiple (business) domains and has therefore no natural home. Therefore, we recognize a need for platform domains in addition to the business domains.

Cross domain access

One of the purposes of the data warehouse is to enable insights where data is easily joined from different sources. How can we do this without creating spaghetti dependencies? Our approach to this was to version tables in BigQuery and basically treat them as APIs. The team owning a domain can rollout changes without coordinating with all consumers of the data. They make sure an old version and new version is available for a period of time, when consumers can be shifted over to the newer version. Once all consumers have upgraded the previous version can be removed.

Final words

Nordnet comes from a history of business intelligence and analytics and is still at the start of using the data for ML/AI and more product focused use cases. This has also influenced the architectural design and our Data and Analytics platform could probably be more labelled as a domain-oriented data warehouse than a data mesh. However, with the architecture described here, we believe we can more easily scale up our data initiatives in the future, and begin shifting towards a truly distributed Data Mesh when we need to.

Since the first version of this article was written, many months has passed by. GCP has introduced a number of services such as Analytics Hub and Data Plex which supports many of the ideas behind data mesh. Also I have moved on and do no longer work for Nordnet, but I know the initial ideas and the architecture behind the Analytics Platform is taken well care of.