Changing Enterprise Data Processing

In a Service Oriented Architecture based on Domain Driven Design

Published in

Jumbo Tech Campus

11 min readJul 18, 2022

Big enterprises generate and consume a tonne of data. Jumbo Supermarkets in example has to process a lot of orders created by customers. But we have to do analytics over them as well, being able to make the proper forecasts and reduce waste in the entire chain.

We haven’t became The Netherland’s second biggest grocery retailer over night. This means that this capability has grown over time. And like with any architecture, organic growth of an architecture helps you to get there, but also reduces your future ability to grow and expand the usage and opportunity if you aren’t open for architectural improvement.

As I’ve written in many of my previous blogposts, we are transitioning strongly towards a Service Oriented Architecture, based on Domain Driven Design practices. In short, this means that we strive to segment our software solutions into smaller pieces that can be worked on per team. Each team has the ability to work on a piece of the puzzle without harming solutions of other teams. This maximises team velocity by reducing impediments, but gives focus and quality on the solution itself as well. Pieces that can be owned can be guarded in functional as well as non-functional (security performance observability etc) aspects. Because we build them in a Domain Driven way, we make sure that each service resonates well with business, and the language Technology and Business speak is ubiquitous.

In order to deduplicate responsibilities, any system could perhaps supply suggestions for content, but only the service is responsible remains solely responsible for the domain. If you want to know the state of, let’s say a product, you should never look beyond the Product Service because that’s state that counts, and nothing else.

How does that relate to Data?

When data volumes, their variety and the amount of teams grow, it becomes harder and harder to maintain a standard over data structures. Was it ‘product’ or ‘article’? Was it ‘Number’ or ‘ID’? Stuff like that, as well as the formatting of content. Did we express in kilograms, or grams? Do we send datetime, which ISO format, or did we send unix timestamps?

But that’s one part of the problems you get. The other part is which attributes we actually have of (in example) a product. Does it contain nutritional facts, does it contain information about the producer? And so forth.

There is a solution

Having a single responsible entity in your landscape can solve that problem. But when you do so, it creates a myriad of other problems in the process which I address in this blogpost.

Because whenever you say ‘single’ this inherently means that changes to that entity have a ripple effect throughout the entire landscape. So how do we tackle that, while we still retain all flexibility for downstream consumers? And how do we prevent breaking downstream reports and systems?

Ownership

As I’ve stated in the introduction, there should be only one ‘system’ or ‘service’ responsible for one entity, or body of information if you want. Since they are developed in a Domain Driven way, they expose and govern a single aspect of the data and the business functionality. This is required for operational systems and the teams working on that.

Stating that there’s a requirement in that aspect, and ownership lies there, it is only logical that there’s no duplication in that ownership within the data realm. Data should flow down from a service to the data landscape, and there shouldn’t be a remodelling as soon as it arrives there. Simply because that would cause shared responsibility, which causes friction in the organisation and therefore becomes the root of outages, misunderstandings or endless meetings on definitions of things.

Automated Transfers

We need to make sure that all data is available for analysis. We cannot upfront know which data will be used in analysis and reporting. Therefore we should consume everything. Sometimes data doesn’t seem interesting enough to do analytics with, however when you analyse it over time, it becomes apparent that the data can give you valuable insights on rates, or e.g. year-on-year prognosis.

In order to do this, it is needed to have automatic exposure of data and it’s changes in a standard way. We can do that by ensuring a standard library within each service which exposes API endpoints that expose the Jumbo Data Model of this entity in its entirety, combined with an endpoint that exposes the definition of the data structure and its expected contents.

A generic adapter then listens in on changes of that service, consumes it and sends it over to a central data repository where it is stored with a date stamp, a reference to the definition of the data structure. If the version of the data structure version didn’t exist yet, it inserts it.

Preventing pollution

It’s easy to think of services as a blackbox that contains business logic, or as a blackbox that relays the logic towards external systems. However, there should be some guards in place that prevents these services from becoming a hot-spot for integrations. The issue when you integrate directly from your service, it’s capabilities and ownership get mingled by the capabilities of the upstream vendors.

Vendors might have thought well and hard about the data that they expose, as well as the way in which they do that (iDoc, xml, json, gRPC, you name it), but it is paramount that your organisation thinks about what it deems logical within the realm of that service combined with the set standards. E.g. what does your business owner deem logical as properties of the data that he or she is responsible for? Remember that IT is only there to enable business to thrive. Once you start looking at what you deem logical, rather than what you’ve got handed to you, you’ll see that there is more to the subject than you might have thought initially.

Remember that IT is only there to enable business to thrive.

This is why any ingress or egress, coming to, or leaving the service should always occur in your companies own datamodel, and all logic within that service should be working with that data.

This already prevents you from consuming data that “was already available from the upstream resource”, that has nothing to do with your functional domain. You’ll refrain from feature-creep or drifting and unclear responsibilities.

In order to translate the vendor’s way of communicating to your own datamodel, you can now create an adapter that is responsible for the translation. That’s right. It adapts way of talking A to your datamodel B, or the other way around, by which ever means possible. Only the rule is that it is not allowed to apply any data logic, nor is it allowed to perform intermediate storage. That would make it a service ;-).

Let’s see what our solution now looks like.

But Tim, what about non-streaming data?

Of course there are processes that require bulk ETL (Extract Transform and Load) operations, and cannot be done in a streaming way as we do with the services.

It’s always good to focus on doing streaming analysis of your data. It makes everything more predictable, and we obtain real-time insights. However, you can imagine that there are processes (like order forecasting in order to predictively replenish the warehouses with enough products to fill our stores) that have to run at a certain moment and produce a bulk of data. We need to cater for these data streams as well.

And what about historical routes?

Well at the moment we have a so called ‘conformed layer’ in our data warehouse. This layer contains the data in a normalised way, ready for downstream consumption. With the new way of working, we won’t need that layer anymore since data will already be ingested and conformed to the service’s standard, which is the only standard we by then really care about.

However, of course there are a gazillion reports dependant on this data. You can’t — and you shouldn’t — for the sake of ‘new architecture’ create a business impact that large. It’s impossible to transition to a new structure over night. So we need to keep supporting it for as long as we need to, before we deprecate it.

But I need one definition, not a list of versions!

No you don’t. This is where we often go skewed on what we need versus what we want. The project you are working on depends on specific properties of the data. But not everything of it. This is where we introduce a big change. You have to write down which data you actually need for your team to be able to do the calculations you want to make. Not more, not less.

In order to get that data, you’ll have to configure a file for your project. That file contains references to the fields you require and if you need them materialised (because you need a performant dataset, or there will be a frequent access to this dataproduct), or virtualised (slightly more expensive on compute, slightly less expensive on storage, more accurate / real-time).

Ownership within downstream consumers

If all teams constantly have to work with the same data and create more and more dependencies on it, it will become hard for anybody to understand what the implications of any change is.

So as a team working with the data, it is important to have your own space that holds your interpretation of the data and the queries or mutations you’ve employed for your data products.

This is why the configuration holds a pointer to a project space. The project space can for example be all about stock. That doesn’t mean that there can only be one product coming from that space. It’s like Domain Driven design all over, but this time it’s within the data realm.

Your space can be all about stock, but can deliver many products, like: stock at store level, excess analysis, forecasting insights, etc etc. It’s a space that has a name that herds all products together of a specific type. It shouldn’t be too big so it has to contain all products, but shouldn’t be too narrow either, since you should be able to hand it over to another team once the load on your team becomes too much.

Because you prevent this project from using all data all the time, it will be easy to understand for anybody how the data came to be and how it is being shipped.

Data Products for Consumption

When a Data Product becomes something that can be used for other teams as well, it becomes a capability of the organisation. It becomes clear that the data should be exposed over a service. This way other systems can also consume the newly acquired wisdom by processing it further or simply displaying it in a dashboard.

But remember? A service contains that default library which exposes its model to the data layer. This exposes a cyclical flow we have created. This serves it’s purpose rather beautifully, because it allows other data products to use your aggregated data to work with, without duplicating the knowledge nor the data that was required to acquire the wisdom.

In example, take financial reports on data. One team can create those reports, and other teams (when allowed) can consume the result. This will ensure that there will never be discrepancies in any report.

Self service

When you do not use all of the data in your project space, it might become hard to do self-service on your data. We have many business stakeholders that need to look at data and do analytics themselves. They take all kinds of data-products and cross-relate information for their job to successfully work with it.

In example, it is not relevant to use EAN product information in order to calculate stock information data products, but when you work in a Distribution Center, knowing which EAN’s belong to the product is of the utmost value.

In order to do that, we require the data to adhere to the language standards and the key identifiers with the help of linters in pipelines. You can create all kinds data products, but when we stick to productNumber (and not articleID or something else), the internal self-service customer remains able to cross-reference the data with the original product data and make whichever report he or she needs.

Let see what that amounts to:

Are we there yet?

Not quite. There’s one thing that forms a killing trait when we walk this route. It is the chain dependency on the data model. See, the team with developers does not know about the existence of the data products downstream. The organisation is just too big for anyone to know about all of these things. Not-knowing what will happen to downstream consumers when you (have to) change a structure, might still break everything you hold dear.

In order to solve these issues, we will rely on something that the development community relies on for a long long time. Deployment pipelines.

Whenever the service changes its definitions, the pipeline will check if that breaks anything that consumes it downstream.
It first does so by checking which configurations leverage that data. If the data is used, it requires a new major version update.
Developers should be as reluctant as can be to change the major version, because it will cause them to maintain two versions for the time to come. Instead, they can re-assess if it’s worth the change, if there’s another way that makes more sense, or if we are okay with it because we can introduce a new portback configuration to the data adapter layer together with the team that did the engineering on that config initially.
It might also be the case that there is a self-service report that used the data, which would break as well. Since these aren’t defined in code, we can do a lineage check and report that back.
We thus prevent regression whenever models change, and we make sure that we catch issues before they arise.

However, it might be the case that the landscape gets unexpected data, or in unexpected volumes.

This is why we decouple data-ingestion from data-processing. The raw storage layer consumes and immediately returns without doing any processing. Even if there’s a complete rebuild of every data domain in the back, the data will still arrive in an orderly manner, well-versioned and stored with timestamps.

To make it complete ;-) this is the diagram where we eventually are working on: