μ Architecture

Daniel Buchta
11 min readJul 24, 2022

--

Source: https://en.wiktionary.org/wiki/%CE%BC

Data mesh by Zhamak Dehghani can be seen as new framework for intensive data distribution.

Data mesh objective is to create a foundation for getting value from analytical data and historical facts at scale — scale being applied to constant change of data landscape, proliferation of both sources of data and consumers, diversity of transformation and processing that use cases require, speed of response to change.

Zhamak Dehghani, https://martinfowler.com/articles/data-mesh-principles.html

As by the definition

Data mesh is a decentralized sociotechnical approach to share, access, and manage analytical data in complex and large-scale environments — within or across organizations.

Zhamak Dehghani, Data Mesh

In the end of the day, it has to be valued by proper data architecture pattern.
We saw the rise of lambda architecture and kappa architecture during big data period.

Source: https://isdanni.com/streaming-system/
Source: https://isdanni.com/streaming-system/

Both lambda architecture and kappa architecture were oriented to deliver continuous near realtime pipelines. By that I mean to create data tubes to transport dat from source point to destination where value is derived.

I think there’s time to come with new architecture to catalyse data mesh principles.
Let’s call it μ architecture.
It has one and only one main proposition:

Queries are passive and data are active.

μ Architecture

This idea is not new as for example data streaming uses it as a paradigm.

In a traditional database, the data sits passively and waits for an application or person to issue queries that are responded to. In stream processing, this is inverted: the data is a continuous, active stream of events, fed to passive queries that simply react and process that stream.

Jay Kreps, https://www.confluent.io/blog/every-company-is-becoming-software/

Let’s find out what does it mean. Till now, the query is the value bringer. You have the data store and your main goal is to write query that fits for your app/API. In fact, it doesn’t bother us whether the all data are there neither what value they have.

This status quo is quite natural because till now, typical enterprise data architecture consists of set of data producing apps with their local data models/data stores and some DWH relative centralized storage. Local data model are here mainly to serve OLTP processes, eg. keep apps going. DWH relative storage is the only source of truth, eg. there isn’t anything else against which this can be validated.

Then, the only thing we are going to achieve most the time is to write query the way it returns result that is expected by it’s user. And that user is typically our colleague.

Moreover, as DWH like storages are used often for long term aggregations like for example How many customers bought our product during last 3 months, there’s no place for thinking about some dynamics.

Nowadays, two facts arise, that are changing our perspective:

  • Standard data user model shifts from enterprise employee, trained to work with data the way enterprise wants him, to enterprise customer, who is neither trained nor paid to do so. In fact, he pays for having smooth and meaningful user experience:)
  • Technology stack as data streaming, cloud computing, serverless etc. enables us to let our data continuously flow to be almost everywhere at almost the same time:)

So there is time to concentrate on data as a dynamic fluid to expose rather than static object to analyze.

It means that essential is to have data in motion.
This leads to classical Data Dichotomy as described by great Ben Stopford https://www.confluent.io/blog/data-dichotomy-rethinking-the-way-we-treat-data-and-services/:

Data systems are about exposing data. Services are about hiding it.

Ben Stopford

Transitive vs. distributive data.

Transitivity is about data travelling in the time of querying. Almost all of nowadays queries with complexity beyond simple querying of transaction that just happened within the app that queries struggle with two things:
1. Some of the data are transfered from the remote source as a part of query. This is typical for integration platforms like ESB.
2. Some of the data are “on the road” in the time when query is executed. Your query just runs over database that does not have the moat valuable data. This is typical for DWH.
These two pain points leads us to the next chapter:

Linear data processing vs. data everywhere topology

By linear data processing we mean what is typically represented by pipeline. It’s the prevailing pattern based on transportation of data from the point A to point B. Moreover, most the time both A and B are terminal states in terms of that A is the data origin place and B is the place where value is produced.

This pattern leads to typical architecture of N data systems connected via full N*(N-1) pipelines.

The problems of this way are well known:

  • Interconnection of all systems the each other leads towards enormous amount of change management and adding new one is always great thing.
  • Collaboration between teams/systems is possible in quartal level at best as nobody can create new pipeline in less the 3 months time.

There was the SOA approach based on ESB. But the Common Data Model concept was the fail as it just added another complexity.

The Data Mesh principle, at the other hand, opens the way to post/consume from the point in data topology that fits the use case best. It embraces collaboration because the finding best fit point in topology is much more simple as creating new pipeline. Interconnection of the systems is not so bully straightforward like when traditional pipeline one. This allows us to think in terms of data everywhere especially when using technologies like data streaming with it’s continuous processing.

Real-time use cases vs. Continuous processing

Talking with the people that can profit from data, especially in case of data streaming usage, they often tell, that they don’t actually need to solve use cases in real-time.

What I think is necessary to let them know, is that current technology stack allows us to continuosly process the data.

By this we mean, that data are not processed in separated time windows(typical example are night hours for DWH).

This allows us to:

  • less complex processing as we don’t need so much complexity to glue this time windows together
  • less computing power as we are processing fewer amounts of data
  • more accurate results both in time and data quality as discrepancies caused by time delays are minimized

∞*∞=∞

For big data, especially in cloud, we can consider infinite storage. Then, theoretically for now, we can consider also infinite many data sources. This can be explained by little bit courageous sentence

Every query deserves its own database.

Data uncertainty principle

When we operate with data, we are tackling same problem as with Heisenberg uncertainty principle. Data have their position given by Volume V and their momentum represented by value 𝜈. Volume and value can’t be measured without affecting each other. So there is place to state uncertainty between this two values

ΔV*Δ𝜈≥V/2f

where f is frequency of adding new amount of data. This represents the statement:

Extracting of data value and data volume from a system containing data affect each other with uncertainty that can be denominated by product of V/2 representing number of operations to find some information within the data volume V and 1/f as inverted frequency of changes representing how old are the latest data in the system.

Disclaimer №1: The term data value is not well defined. It will be the part of the next work uppon this concept.

Disclaimer №2: Please note that for quantum computer using Grover’s algorithm, there will be V/2 replaced by √V.

Queries are passive and data are active.

To explain the idea of μ Architecture, let’s start with this picture by Zhamak Dehghani.

Credit goes to Zhamak Dehghani, Source: https://martinfowler.com/articles/data-monolith-to-mesh.html#DataAndSelf-servePlatformDesignConvergence

Let’s continue with basic neuron structure.

Source: https://en.wikipedia.org/wiki/Neuron

Let’s compare this for example with modern streaming technology like Apache Kafka.

Source: https://medium.com/@navdeepsharma/the-ecosystem-of-apache-kafka-6087b621d16f

All 3 diagrams are about the same pattern consisting of tripplet get,transform, serve. This, btw. is generic pattern also for ETL, ESB etc. What is new thanks to nowadays technology stack here is ability to make a mesh consisting of many of these tripplets operating distributive way continuously near real-time.

Then, complexity of query is spanned into lot of μ-queries representedfor example in Kafka ecosystem by Kafka Streams/KSQL. These flow of queries serve as percolators of continuous data flow making data active. We can imagine it’s similar to hydroelectric power plant system or neural network.

Source: https://www.asb.sk/stavebnictvo/inzinierske-stavby/vodohospodarske-stavby/ako-sa-planuje-prevadzka-vodnych-elektrarni-na-slovensku
Credit: https://denisezannino.wordpress.com/2014/09/30/neural-circuitry/

In this kind of architecture data are active part that sulphurize the signal through the mesh.

μ Architecture definition

μ Architecture is recursivelly repeating pattern of self simillar data micro processors consisting of tripplets source connect, processor and sink connect, which we will call data μ-products. These μ-products then create mesh. Data flow through this mesh while being transformed along the streams. This pattern allows us to spread the complexity of processing across the mesh through distribution and scalability.

Principial difference with Data Quantum from Data Mesh principle and μ-product lies the range of both as μ-product is much smaller, has given simple structure. Thus, Data Quantum consists of mesh of μ-products. Moreover, this mesh does not need to be compact neither continuous. It can consist of one or more separate parts.

Let’s return to Queries are passive and data are active. Tripplets that μ-products are created of, have processor, where change of information happens. These processors are now in passive position as their change is no more acceptable in arbitrary manner as they are part of complex mesh structure.

It can scarcely be denied that the supreme goal of all theory is to make the irreducible basic elements as simple and as few as possible without having to surrender the adequate representation of a single datum of experience.

Albert Einstein

But the changes are inevitable:) The trivial change, where the output is not changed the way that affects depending μ-products, doesn’t require complex action. When the non-trivial change of query is needed, we use pattern of building parallel μ-product and resolve all affected μ-products.

This seems to be main tradeoff against the dynamic macro-queries as it looks like there will be many little changes needed to provide.

Fair statement is that it’s absolutely valid point of view.

From my point of view based on practical experience this can be set to minimum by applying four principles of Data Mesh:

  • Domain Ownership
  • Data as a product
  • Self-serve data platform
  • Federated computational governance

In fact, Data Mesh is the foundation/enabler to make next step on the way from ETL through λ and ϰ Architecture towards μ Architecture.

Multi Technology Stack vs. One Technology Rules Them All

The idea of μ-product, once adopted, leads to initiatives of removing friction based on technological heterogenity of single μ-products. Here, the solution is to build at least core mesh in one technology. To be honest, data streaming solutions as Apache Kafka are the best fit for this for now.

Creating simple stream of user and his pageviews
Flow consisting of two microproducts, USER_PAGEVIEWS and USER_PAGEVIEWS_MU_PRODUCT. https://zz85.github.io/kafka-streams-viz/.

Let’s note that in this kind of architecture, you can reuse components as USERS_PAGEVIEWS. One of my friends told that this approach is similar to cascading subselects to create views in the SQL DWH .

-- use t_pageviews table
WITH t1 AS (
SELECT tp.*
FROM t_pageviews tp
WHERE tp.is_deleted != 'True'
),

-- Add t_users
t2 AS (
SELECT t1.*, tu.*
FROM t1
JOIN t_users tu
ON t1.user_Id = tu.Id
WHERE t1.is_active = 'True'
),

-- Apply some rule, for example Location column
t3 AS (
SELECT t2.*
CASE WHEN tr.Location = "US" THEN "USA add"
WHEN tr.Location = "Slovakia" THEN "Slovakia add"
WHEN tr.Location = "Sao Paulo" THEN "Brazil add"
ELSE null
END AS "Location rule"
FROM t2
JOIN t_rules tr
ON t2.rule_type_Id = tr.Id
WHERE tr.is_valid = 'True'
)

-- filter row that are relevant
t4 as (
SELECT t3.*
FROM t3
WHERE is_relevant = 'True'
)

-- create view for Data Warehouse
CREATE VIEW vw_product AS
SELECT *
FROM t4

Then, queries on the diagram above JOIN:PWT+UT and KSQL-RULE respectively are passive in terms of they are quite simple and they sit within the structure. The streams of new data are the active part in this architecture.

Create products don’t slice them

As Lars Rönnbäck states in his famous The Slayers of Layers (https://www.linkedin.com/pulse/slayers-layers-lars-r%C3%B6nnb%C3%A4ck/)

It is time to rethink the layer. When you see a layer, act with suspicion and question its existence. If nothing else, think about the costs involved having to maintain every additional layer. Above all, we need to go back to the root of the problem and create tech that needs few or no layers to be acceptable.

μ Architecture as the architecture representation derived upon Data Mesh by Zhamak Dehghani is here to stop us from slicing architecture into layers.

Credit: https://www.freepik.com/free-vector/soccer-ball-grass-background_2875611.htm and https://martinfowler.com/articles/data-mesh-principles.html

Now we can create products as topology representation consisting of data products.

This opens completely new perspective regarding time to market and comes with the first real data-driven approach.

If you are CEO or product manager, ask your architect whether it’s still the best approach to slice your product by technology or logical parts. Have a fair talk about whether you have to wait while your product will be sliced into stripes, then packed into roadmap and delivered in about year.

The main power given us by Zhamak Dehghani through Data Mesh is to deliberate us to create products meaningful way.

Conclusion

In this article, new data architecture pattern is presented. I call it μ Architecture and it has one main principle:

Queries are passive and data are active.

It has this definition.

μ Architecture is recursivelly repeating pattern of self simillar data micro processors consisting of tripplets source connect, processor and sink connect, which we will call data μ-products. These μ-products then create mesh. This pattern allows us to spread the complexity of processing across the mesh through distribution and scalability.

It’s foundation is Data Mesh by Zhamak Dehghani as both enabler and sociotechnical framework. The second enabler is application of Data As a Product in terms of repeating self similar μ-products organized to create mesh.

Thanks for reading:)

Story of μ Architecture continues:

μ Architecture goes machine learning

μ Architecture offers Next Best Offer

--

--

Daniel Buchta

Architect | Data & AI/ML Enthusiast🚀 | Quantum Computing Pioneer🛸 | Chaos Theory PhD.