Modern Data Stack Discontent (Part 1)

Jon Osborn
7 min readJan 3, 2023

--

Something about challenges with handling data has been gnawing at me for a long time. I’ve been trying to reconcile the (so called) Modern Data Stack with the reality of increasing costs and complexity. I’m a fan of tying investment to outcomes so I’m also wondering if the business outcomes are really paying off given what’s being put in. Are you worried about that too?

I’m defining the modern data stack as an assembly of separate cloud native tools that move data from ingestion to data warehouse. For example, it is quite common to have separate tools for ingestion (ETL/ELT), orchestration, and pipelines. On their own, each of these tools are very good at what they do, and when wired together, can build some amazing things. We like these tools so much, in fact, that we tend to ignore (or not include) the cost of integrating the tools or maintaining them. A low-friction, siloed organizational structure inevitably follows, despite our best intent. As day 1 turns into day 2 and data scales up (both volume and compute demand), all the little bits of glue and tape start to add up to an underwhelming delivery of a backlog that just keeps getting longer.

Is this your experience?

If this is the modern data stack experience, I’m looking forward to something more. Cloud technology is moving so quickly game changing tools are popping up, seemingly, every week. I need a stack that better accommodates the world we need to work with, not the world we’ve built on or own. Not sure about you but, I’m ready for a fresh approach.

Complex, aging, costly to fix, data pipelines

Looking Back

Sometimes, it’s helpful to see where we’ve been so we can contextualize where we’d like to head.

Architecturally, we make good decisions with the information we have at the time we need to. There’s no crystal ball so issues will certainly pop up that get corrected along the way. Over time, the architecture settles down, the organization aligns (or re-aligns), and the budget stabilizes. From a strategic perspective, we’re all looking out for changes or trends that we might need to react to but, for the most part, barring major business requests, we’re in optimization mode. Swapping one ingestion or orchestration tool for another has no compelling business benefit. If you have a modern data stack, things seem good.

Yet, trouble is brewing. Not with the individual tools…they’re very good at what they do. You picked them because they were a good fit. Maybe your business has some market based compelling event and the year long project to meet the need is causing frustration. There’s no realistic way to cut the schedule down to a couple of months. If you’re a strategic leader but the compelling event hasn’t happened yet, what options do you have to avoid the frustration?

Unfortunately, it is clear that our options are limited. Over the project history for our organization, we’ve built and optimized governance, DevOps, CI/CD pipelines, around separate tools that require individual skills to operate. Some tools are Cloud SaaS products while others operate within your network. Conway’s Law forces the organization into silos. Adding people and more meetings (even ‘effective’ ones) doesn’t help. Cross-training skills with the hope of delivering a matrixed organization is a pipe dream. No amount of optimization can remove the need for the silos to work together (communicate) so they can deliver the simplest data product. The few people who can understand the complexity and (magically) deliver the most important changes can’t scale beyond a couple of projects a year.

Now what?

Starting from Scratch

Let’s set aside where our organization and technology sits and take a theoretical approach that might bear some fruit. What would a data stack look like if we started with current cloud technology and laid out an architecture? Is there a way to use a different mindset and leverage current technology that yields a data stack with more options?

I’m going to refer to the new approach as the ‘Postmodern Data Stack’. On a funny note, I used the term the other day and someone commented that maybe getting “artsy” with data might not be the right approach. I’m not sure about that but I do think Webster’s definition of Postmodern is spot on for this topic. I hope you agree that elegant, simple solutions to complex problems tend to last a long time and are quite beautiful.

Certainly the challenges we laid out deserve some individual treatment but we’re after a new approach with some measurable outcomes. We’re looking for (what feels like) a radical new approach focused on delivering value, and not just some incremental improvement on what we already have. Rather than talk about solving individual problems with this or that tool, I’m going to focus on what we want our world to actually be.

Outcomes

Every business has their own idea of what a positive outcome might be. If your outcomes are centered around spend, or speed, or growth, a new approach should feel different to the whole organization. If it doesn’t, you are likely still working in an incremental way.

I think it would be fair to call our solution ‘postmodern’ (transformational) if these were some of the major outcomes:

  • Consumers of data are be biased toward action (actually doing work) vs. order/ticket creation
  • Training time for new users (especially non-engineers) is short
  • Workloads are directed to right platform for performance and cost management
  • New database or compute technology is easy to integrate with or migrate to
  • The user community will grow (Data Democratization actually happens)
  • Moving to the new platform provided incremental value during the transition (vs. large one-time migration cost/risk)

I believe focusing on outcomes is a solid approach but we need measurable metrics to guide us as the journey begins. We can’t improve it if we can’t measure it (Peter Drucker). You may have some specific measurements in mind based on your industry vertical, or experience, or other context you’ve accumulated in your career. I have my own specific measures for healthcare and commercial insurance.

Ironically, if measuring your modern data stack is difficult, I’m sure a postmodern stack would help with that.

Measurements

We can control what we measure so let’s try to be as broad as we can while still being useful:

Time to Value (or Availability)

This is a measurement of two very real data scenarios:

  1. How much time does it take for a new datasource to drive an outcome in production?
  2. How much time does it take for data to move from source (loosely, ingestion) into a spot where it is available for production consumption?

These scenarios speak directly to efficiency, automation, and observability. If you can’t measure this easily, seeking out a surrogate measurement is useful while relatively easy to measure once you work out how to bucket and aggregate your performance data. Do we need to have the mythical “real-time” reporting? Likely not but, faster is better and you probably want to know if your system is slowing down month to month or quarter to quarter.

Time to Recovery

No data or pipeline is perfect or 100% reliable so when bad things happen, we want to recover as quickly as possible. In best case scenarios, we fix it before the customer even knows what happened. Everyone in IT dreads the call from the business partner who’s about to tell you what you should already know. Improving time to recovery will require more quickly identifying the source of a problem, correcting the issue, and the ability to resume processing from the point of failure.

Cost to Scale

What would it cost if your business had to start consuming 10x more data? How long would it take to prepare the systems for this impact? How many people would need to be hired? Do you need to do some significant architecture work to survive?

The Postmodern stack should have a very flat cost curve. Adding lots of new volumes to the data you already have will surely increase storage and compute costs. Connecting new data shouldn’t trigger hiring or additional licensing cost.

Skillset Participation

Countless references cite siloing as a leading reason why organizations can be slow to deliver. Siloing is unavoidable in the modern data stack (are we then destined to be slow?). An un-siloed, or fungible, data stack would reduce the number of required skills, present self-evident data models that reduce communication between teams. It would meet the needs of data engineers, analysts, and data scientists with a single platform. Let’s measure how many people are using the stack and what they are doing with it.

Data Quality Sampling Percentage

We all suffer from a lack of testing and quality measurement in the data space. For those who have managed some sort of solution, it is likely either brittle (expensive to maintain), or highly customized which means moving the solution from one compute and storage solution to a different one is difficult or impossible. Let’s hold the stack accountable for helping us by measuring end-to-end (loosely, ingest to delivery) data quality sampling coverage.

Other Measurements

There are a whole host of other measures we could dream up and add to the list. From a strategic point of view, however, I’d like to keep the list short and focused on the stuff we really want to change. Performance, cost, etc., are all things we can optimize later.

Envisioning a Postmodern Architecture

Given the above outcomes and measures, what realistic architectures can we propose that will have a positive impact? What does the architecture need to do? What does Postmodern look like?

Can we migrate to postmodern architecture without blowing up our world? Will it be worth it?

What are your ideas?

I’ll dive into some of my thoughts in Part 2.

--

--

Jon Osborn

Field CTO | Cloud Executive | Data Professional | Writer, Golfer, Hiker