The Brutal Cost of Data Mesh

Hannes Rollin
11 min readJul 25, 2023

--

Beware of little expenses. A small leak will sink a great ship.

—Benjamin Franklin

Not made for burning (DALL·E)

I’ve written at length about the downsides of data mesh, when not to use data mesh, and about data mesh observability, but I have so far avoided diving into the most uncanny part of it: The cost of data mesh. This will be a self-contained post, but an important one. Indeed, I think it’s so important that I give you the bottom line up front: The initial conceptualization and implementation phases may look expensive, but what breaks your neck is the accumulating cost of maintenance, which will easily grow faster than your value creation and, at some point, surpassing it. When your yearly cost outgrows your yearly revenue, you’re with your back to the wall.

At first glance, it seems to be different in government agencies, where revenue generation plays (yet) at most a bit part, but if you think about it for a few seconds, you immediately see that the same forces are at work: If maintenance costs keep rising, you’ll eventually exceed your allocated budget, and then you’re in trouble. Something has to give.

But let me begin at the beginning.

Maintenance, the Ugly Sister of Development

The naive assumption is that the major part of any IT endeavor, be it in software, platform, or data product development, concerns the development phase, while the ongoing upkeep, maintenance, or operational costs are tiny in comparison. This intuition probably comes from our collective experience with our built environment, where—for a few decades at least—the maintenance cost of streets and houses is by several magnitudes cheaper than construction cost.

In the software and data realm, this isn’t so. It’s not just that development is comparatively cheap since often no or very little physical investment is necessary, a trend exacerbated by cloud computing, where you rent virtual machines instead of buying real ones; no, it’s furthermore and crucially that IT maintenance makes for the majority of all IT efforts. And the trend is clear.

Maintenance costs as part of the total cost have been rising steadily (source)

There are numerous drivers behind this:

  1. Complexity of Systems: With time, software systems have become more complex and interconnected, involving more lines of code, modules, and dependencies. This has made maintenance tasks more challenging, requiring more time, effort, and resources.
  2. Legacy Systems: Companies often use older software systems, which can be costly to maintain. Over time, the original developers often have left the organization, the software’s documentation might not be up-to-date, and the programming languages, frameworks, and protocols used, more often than not, have become obsolete, making it more difficult and expensive to maintain these systems.
  3. Frequent Updates: Software today needs frequent updates to address security vulnerabilities, fix bugs, and add new features to keep up with user demands and market competition. This constant cycle of updates and patches also contributes to the rising maintenance costs. Software is never really done. Until it’s dead.
  4. Integration and Compatibility: Software needs to be compatible with a variety of operating systems, platforms, and other software. As these external systems evolve, maintaining compatibility becomes a significant expense.
  5. Regulatory Compliance: Software in most sectors, not just in healthcare and finance, must comply with various regulatory standards that are themselves maturing and proliferating. Keeping software compliant with these ever-changing regulations involves ongoing maintenance costs.
  6. User Support: Providing support to users, addressing their concerns, and solving their problems is a crucial part of software maintenance. As the user base grows, these costs increase significantly.
  7. Technical Debt: Unfortunately, way too often, in order to meet deadlines or due to other constraints, developers might take shortcuts that lead to “technical debt” like unwritten documentation, insufficient testing, insecure shortcuts, etc. Over time, these issues must be addressed, contributing to maintenance costs. Otherwise, technical debt has the habit of striking from behind by way of sudden outages, embarrassing security leaks, and unnecessarily prolonged maintenance efforts.

My central hypothesis is this: Data product development, which is at the heart of data mesh, inherits not just the good parts of modern software like DevOps, decentralized architectures, and a variety of helpful tools and technologies but also the bad parts like bugs, depreciation, and—a growing maintenance burden. And it’s not even that maintenance effort is about constant, which would make it rather easy to cost-control your data mesh. Maintenance is convex; it bounces back, and this makes maintenance efforts antifragile—the more different things you try and the more stress and energy you put in the system, the more your maintenance cost mounts up. And this is for all systems, not just those built for continuous growth.

Total maintenance load is convex and, at some point, goes up (source)

Total Operational Cost of Data Mesh

While maintenance is “just” about the human effort needed to keep a system running, operational cost counts everything that is billed to your data mesh. A data mesh, to recap quickly, is a distributed data architecture that decentralizes the control and ownership of analytical or research data, dividing it into multiple autonomous teams that handle different domains of the data, tied together by a central data platform and distributed governance. This approach has, ahem, unique implications for operational costs. Here’s a rough list:

  • Hardware Costs: In a data mesh, each data domain might run on separate hardware, potentially increasing overall hardware costs but also allowing for more efficient scaling.
  • Software Licenses: Data domains might use different software stacks, which could affect licensing costs.
  • Cloud Service Fees: Data mesh architectures often leverage cloud-native technologies, which may shift costs from hardware acquisition to cloud service fees. It’s a difficult bet whether to opt for owned hardware that’s amortized quickly but is suffering from rising energy prices or whether to opt for cloud services, which come without initial expenses but can be subject to unexpected and extravagant price rises.
Azure isn’t better (source)
  • Personnel Salaries: Instead of a centralized team, you have multiple cross-functional teams responsible for different data domains. You need data product owners, data product developers, data product testers, etc. This could lead to increased salary costs due to the need for more diverse skill sets across teams.
  • Electricity Costs: This can be affected by the specifics of your implementation — for instance, if you’re leveraging cloud services, this might be bundled into those costs. But pay you will.
  • Networking Costs: With data distributed across different domains, there may be increased network traffic and, therefore, higher networking costs.
  • Security Costs: Security becomes more complex in a data mesh, as each domain needs to handle its own security measures. This could increase overall security costs.
  • Training Costs: With the data mesh approach, training costs might increase due to the need for domain-specific knowledge and skills.
  • Depreciation and Amortization: This might be complex to calculate due to the distributed nature of assets across different data domains.
  • Technical Support Costs: Support might be more complex to provide in a data mesh due to the distribution of responsibilities across different domains. A superb self-serve data portal might look like it can do without support, but that is wishful thinking.
  • Communication Overhead: In a data mesh, communication overhead might increase due to the need for coordination between multiple autonomous teams. This overhead might come in the form of time spent in meetings, the use of collaboration tools, or the development of processes and protocols to ensure effective communication.
  • Data Governance: In a data mesh, each team is responsible for the governance of their own data, including data quality, metadata management, and ensuring regulatory compliance. The decentralized nature of a mesh could increase these costs due to the need for domain-specific governance strategies.
  • Integration and Interoperability: The distributed nature of a data mesh means that you might need to invest more in making sure different parts of the system can effectively share and integrate data. This involves costs related to APIs, data standards, and the integration platform.
  • Decentralized Monitoring and Alerting: Each team within the data mesh needs to monitor their own services and have an alerting mechanism in place to handle incidents. This requires sophisticated monitoring and alerting tools and a superior observability strategy <L>, adding to the operational cost.
  • Domain-specific Tooling: Since each team owns their data domain and operates independently, they may require domain-specific tools for data processing, storage, analysis, and more. This increases the overall cost.
  • Data Discovery and Cataloging: Given that data is distributed across many domains, effective data discovery and cataloging tools are essential. There may be costs associated with implementing and maintaining these tools.
  • Redundancy Costs: To ensure high availability and disaster recovery, some degree of redundancy is needed, which can increase storage costs. And it’s not just redundancy and storage: If you want to go full antifragile, it gets expensive.
  • Infrastructure Management: The complexity of managing a distributed infrastructure leads to increased costs, both in terms of necessary tooling and the manpower required.
  • Data Privacy and Compliance: Compliance with data privacy laws can be more complex and potentially costly in a distributed system, as each domain must ensure it is handling data in a manner consistent with these laws.
  • Vendor Management: If your data domains rely on external services or vendors, the complexity and costs of managing these relationships increase.
  • Cross-domain Coordination: While not a direct cost, the effort and resources required for coordinating actions across domains (like schema changes, system upgrades, or security practices) increase the operational expenditure.

Still with me? Then you’re brave, and as Taleb rightly remarked, courage is the only virtue that can’t be faked.

You can see that most of these points grow linearly with the mesh (like software licensing and personnel salaries), while others grow polynomially due to negative network effects (like networking costs and communication overhead). If you know your complexity theory, you remember that no matter the constants, if you add functions, the one with the highest complexity always wins eventually.

Simulating Cost and Value in a Data Mesh

Here’s a simple way to simulate the cost and value in a data mesh using System Dynamics, which involves constructing causal loop diagrams and stocks and flows diagrams to model the interactions in the system.

First, we define our state variables:

  1. P (Participants): The number of participants in the data mesh.
  2. D (Data Products): The number of data products in the data mesh.
  3. V (Value): The total value of all products in the data mesh.
  4. C (Cost): The total operational cost of maintaining the data mesh.

Next, we tentatively identify the relationships between these variables:

  1. Participant Growth: As P grows linearly over time, it’s likely that more data products are created, increasing D. This can be represented as a positive feedback loop.
  2. Data Product Growth: As D grows, the complexity of the system may increase, leading to higher operational costs. This is another positive feedback loop.
  3. Value Decline: Over time, the value of individual products decreases, similar to radioactive decay. This would be a negative feedback loop.
  4. Cost Increase: As P and D grow, the total operational cost C might increase more than proportionally due to increased complexity, increased maintenance costs, increased communication overhead, etc. This can be viewed as a super-linear growth.

With this model, we can then simulate the data mesh system dynamics using differential equations to represent the relationships between the variables.

  1. dP/dt = k1, where k1 is the linear rate of increase of participants.
  2. dD/dt = k2 * P, where k2 is a constant representing how much the product count grows per participant.
  3. dv_i/dt = -k3 * v_i, where k3 is a constant representing the rate of value decay of individual data products. New data products, by way of an approximation in this model, always provide a certain fixed value that decays over time. The total value is simply the sum of all those.
  4. dC/dt = k4 * P + k5 * D + k6 * P * D, where k4, k5, and k6 are constants representing the cost growth due to the number of participants (e.g., salaries), the number of products (e.g., cloud fees), and the interaction between them (negative network effects), respectively. Interactions are evilly expensive.

And the code. Feel free to play with the constants, but please note that due to the list-keeping of individual values, the code is rather inefficient for large T or small dt. Let me know what you find out.

import matplotlib.pyplot as plt
import numpy as np


k1 = 0.5 # rate of increase of participants
k2 = 0.4 # rate of increase of product count per participant
k3 = 0.03 # rate of value decay of a single product
k4 = 0.3 # cost growth due to the number of participants
k5 = 0.05 # cost growth due to the number of products
k6 = 0.01 # cost growth due to interaction between participants and products

initial_value = 50

P = 1 # initial number of participants
D = 1 # initial number of products
product_values = [initial_value] # initial value of the first product
C = 1 # initial total cost

# Euler's method: start, stop, step
t = 0
T = 100
dt = 0.01

time = [t]
participants = [P]
products = [D]
value = [sum(product_values)]
cost = [C]

while t < T:

dn = k1
dm = k2 * P
dC = k4 * P + k5 * D + k6 * P * D

P += dn * dt

# add a new product (products are descreet)
if D < int(P * k2):
D += 1
product_values.append(initial_value) # new product starts at value 10

# decay each product value
product_values = [v - k3 * v * dt for v in product_values]

C += dC * dt
t += dt

time.append(t)
participants.append(P)
products.append(D)
value.append(sum(product_values))
cost.append(C)


plt.figure(figsize=(12, 8))

plt.subplot(2, 2, 1)
plt.plot(time, participants)
plt.title('Participants')

plt.subplot(2, 2, 2)
plt.plot(time, products)
plt.title('Products')

plt.subplot(2, 2, 3)
plt.plot(time, value, color='black')
plt.title('Value')

plt.subplot(2, 2, 4)
plt.plot(time, cost, color='red')
plt.title('Cost')

plt.tight_layout()
plt.show()

Here’s the output:

Total cost beats total value—eventually

Even if we assume a linearly increasing number of participants and data products, which produces ever more value, individual data products deprecate, which imposes an upper limit on total value, while negative network effects lead to superlinear cost growth. We’re in trouble.

How to Solve It

As far as I can see, there are just three solid ways to handle this problem:

  1. Cover up—the Government Way: Accept the situation and save your data mesh by allocating budgets from other, more profitable parts of your enterprise or by demanding an ever-growing cut from taxpayer money.
  2. Find a Cash Cow—the Google Way: Maybe you hit gold and find one or a few data products that produce so much value and deprecate so slowly that you can continue to finance your otherwise lossy data mesh.
  3. Set Limits—the Hard Way: Given tight budgets that can be expanded linearly at best, and given the absence of fulminant cash cows, you have only one option left: You must strictly limit the size of your data mesh. This entails imposing hard boundaries on the number of domains, the number of data products, the total data volume; you name it. If you can count it, you must limit it. While this isn’t easy, for most data mesh implementations, this might well be the only way of ensuring survival. Welcome to the age of limits.

--

--

Hannes Rollin

Trained mathematician, renegade coder, eclectic philosopher, recreational social critic, and rugged enterprise architect.