Domain Driven Design in Data Engineering (Part 1 of 3)

Daniel Somerfield
4 min readNov 9, 2023

--

A journey of application, adaption, and invention

This is one of three posts originally published in 2019 as a companion piece to a talk I gave at Explore DDD. I am migrating that old content to this new account and Medium doesn’t provide any way I can find to back-date. Just know the thinking is a bit older, or “aged” as I like to think of it.

At his keynote at Explore DDD 2018, Eric Evans stated that Domain Driven Design needed to evolve. While there have been successes, he acknowledged that some organizations have struggled to apply the practice successfully. He challenged the community to find ways to refine the training, the application, and even the core principles of DDD. With that in mind, I would like to share some observations about my experience trying to bring the often metaphor-driven approach of DDD into the often very literal world of Data Engineering.

First I should provide a definition of Data Engineering for the purposes of this article. This is not meant to be exhaustive or definitive, but rather pre-empt any confusion about the nature of the domains to which I am referring.

Data Engineering in the context of this discussion is the process of building systems with most or all of the following characteristics:

  • Ingest of data at very high rate and/or volume whether via batch or streaming mechanisms. Think gigabytes per minute.
  • Storage of very large volumes of data: terabytes or petabytes on disk
    Access to the data by very “wide” concurrent query patterns across many instances
  • Transformation and/or decoration on ingest, query, or often on both
  • Data analytics across very large data sets that involves evaluation and aggregation data both at rest and/ or in transit

These characteristics create a series of common follow-on characteristics:

  • Use of large amounts of hardware, whether physical or virtualized
  • Highly distributed and concurrent systems whose behavior is widely variable based on scale and workload shape
  • The need to be aware of data locality, demanding solutions that move processing logic to data, rather than the inverse
  • The need to be aware of the location of data relative to other data, requiring ingest and storage pipelines that are informed by the needs of query and evaluation requirements
  • Urgent need of horizontal scale often through dramatic peaks and troughs of traffic ingest and query

These dynamics are all highly technical and, I would argue, outside of the conventional realm of conventional DDD focus. They would generally thought of as implementation details that are meant to realize the model, as opposed to core to the model itself. Of course, this point of view has practical limitations. Software development is a complex feedback loop in which the model and the technologies that implement it are in constant tension, driving domain-centric engineers to make constant compromises to balance the need for coherent metaphor but respect the constraints of the technology, or for that matter, physics. Highly specialized technical needs such as extremely high volume ingest or very low latency, for example, impose limitations on the kinds of models we allow ourselves to create. Or put somewhat differently, we can create any kind of model we desire, but then are limited in how it can be bound to the systems they are meant to represent. When I described Data Engineering as literal above, this is what I mean. Engineers in this field are reticent to compromise a great deal to make systems comprehensible when there is so much pressure to simply make these system perform with sufficient reliability and performance.

Last year, Eric outlined several reasons why he believe DDD sometimes achieved disappointing results, among others:

  • inhospitable organizational culture
  • weak practitioner skill
  • weakness of DDD techniques.

I would add a few additional possible reasons why DDD has not taken the Data Engineering world by storm:

  • not necessarily weak, but ill-fitting techniques and models for data engineering problems, perhaps due to what Martin Fowler calls “anaemic” domains in which DDD’s value is hard to justify. Or possibly because the DDD community has simply been focused elsewhere
  • immaturity or, if you prefer, dynamism within Data Engineering that means that the patterns and practices are highly unstable in and of themselves, making it difficult apply DDD to targets that won’t stand still. Arguably a version of “inhospitable environments”.
  • public relations: the DDD community has failed to articulate the value proposition of the practice well enough to this audience and busy developers and architects dealing with challenging engineering problems simply aren’t convinced DDD techniques are worth the time to learn.

There are kernels of truth in all of these statements, but as you might have guessed by now, if I believed that DDD didn’t have a place in Data Engineering, I wouldn’t be taking the time to write this article. On the contrary, I am convinced that there is much to be gained by an acute focus on domain, particularly in systems with high throughput data ingest and query. This kind of technical challenge drives complexity to the heart of our software systems that needs to be tackled. For that to occur, we will need to sharpen the tools for the job and improve our public relations for the audience.

In the pages that follow, I will describe my experiences trying to apply DDD thinking in a data engineering-focused domain, what has been effective and where there has been resistance, what experiments I have run and what I have learned. I will discuss my working theories about how DDD might evolve to be more applicable to these types of domains and discuss some mental models and heuristics for applying domain thinking to them right now.

--

--

Daniel Somerfield

Applying software craft to climate challenges. Opinions are unapologetically my own.