How To Ensure Your Data Architecture Can Evolve

Why engineering with Lego is so much easier than with data.

Carl Follows
Version 1
6 min readJan 9, 2024

--

Data platforms represent multi-year investments, expected to provide insights for over a decade. But as business and technology change much more rapidly, ensuring a platform’s ability to evolve can make the difference in realising a return on your investment.

Comprising many data movement and transformation components, what are the patterns that enable incremental changes to platforms whilst preventing chaotic breakages afterwards?

Building in Lego is easy

So much so any kid can do it, even some adults.

It certainly helps that it usually comes with a clear picture of what you’re trying to achieve, along with instructions for all the steps to get there. But even without such instructions, we can all create something. When starting we typically wouldn’t have a full idea of what we’re creating and we’re bound to change our minds regularly once we start to see how the project is coming together. The reason it’s so easy to evolve our design with Lego is the simple standard interface between the pieces that enables us to swap them out for a better option when see a possibility and our vision shifts.

Photo by Astrid Schaffner on Unsplash

Engineering Data is less easy

Whilst there may be a vision at the start of a data project, it’s typically a lot more vague than at the start of a Lego build and any instructions received are certainly more abstract. But we can make our lives easier by ensuring interfaces are clear and simple.

Key to building a successful data platform is dealing with the constant change of vision, data sources and technology.

Ensuring that a data platform can evolve as its requirements change requires, like any good software development, encapsulation of complex functionality into components. Like when creating with Lego, knowing that each component may need to be replaced but that’s ok because it connects with a standard interface.

Pattern 1: Data Contracts

In software engineering, a function’s signature is the contract for how to invoke the code and what it’s expected to return on execution.

In data engineering, a pipeline is a series of transformations, each one manipulating the data produced by earlier steps. The data contract is the definition of the data state required by and produced by each of these transformations, effectively the interface definition.

  • What attributes does the data have?
  • What’s its granularity (aggregated or event level)?
  • Are there requirements on allowable attribute values: existence of NULLs, length of strings, numeric scale and precision?
  • How can a record be uniquely identified?
  • How frequently is it refreshed?
  • What volume of data is expected?
Photo by Eric Prouzet on Unsplash

Whether you realise it or not, every transformation will contain implicit assumptions about these data contracts. Explicitly documenting them ensures that when a transformation is revised the interface remains the same, or if not an informed impact assessment can be carried out.

Where data is made available to external parties through a published interface, everybody expects to agree on a data contract upfront. Treating internal interfaces in the same way makes the pipelines more robust and allows for demarcation between teams, introducing the ability to scale.

Breaking a data pipeline into discrete encapsulated transformations, each with an input and output data contract helps us build robust pipelines that can be executed in multiple permutations.

Pattern 2: Abstract through Views

When data warehouses used to be built in relational SQL databases (many still are), creating a suite of views was a natural opportunity to build an interface between the warehouse and the reporting layer. Admittedly there was a tendency for engineers to incur technical debt by adding last-minute transformations and layer views on top of views. But when used purely as an interface and using the WITH SCHEMABINDING option it prevented accidental changes to the schema of the underlying tables. The view enabled many aspects of a data contract to be physically implemented in a pipeline.

Although data platforms are no longer typically built on relational databases, many lake house technologies support the use of views. Implementing a pattern of reading from views between major stages of a pipeline, like the zones of the lake when using a medallion architecture, abstracts complex transformations from their successors, which allows earlier transformations to be reworked without breaking the interface.

Pattern 3: Separate Foundations & Interpretations

Understanding which elements of the solution are subject to the most change enables the separation of these from complex stable engineering. Typically this splits the bringing of data into the platform and cleansing it from the modelling to enable business users to monitor key metrics.

In modern platforms, this is often implemented through splitting the data lake into separate zones, conceptually often thought of as medallion or bronze > silver > gold. However, often more than 3 zones are required and this may evolve, so implement physical names that describe the role they each play such as raw > cleansed > enriched > modelled.

Within these keep the business logic as late in the pipeline as possible, with the earlier stages dedicated to shredding or parsing information into standard data types, checking for quality problems and ensuring everything is stored in a consistent format. The interpretation or metrics is much more likely to change from year to year and often businesses will want to see historic data evaluated under new metric definitions. Always consider what will need to change when the business changes.

Pattern 4: Versioning the Data Lake

In a data lake, we bring in data from lots of systems and worry later about which attributes are interesting. Sometimes this is called “schema on read” as it’s up to each transformation to interpret the data. As our data understanding grows and changes are made to the transformations, we need to remain cognisant of our data contracts and not violate them.

Let’s assume we have a simple 3-step pipeline which cleanses, enriches and models some data. We realise there is a better way to interpret the raw data, so we create a new version of the cleanse script but this changes the data contract, so we must store its output in a new location. Next, we need to create a new version of the enrich script that can use this improved interpretation. Since its output is still aligned with the existing data contract required by the model script, no revision of that is required.

Data pipeline transformations and stores

Whilst the enhanced understanding is propagating through the data lake, both versions of the transformations need to be run to support the existing pipeline and future. Through this pattern, we evolve our transformation logic whilst still maintaining the previous version and any historical data.

We can see in this data lake pattern that each interface is also a storage location and as such will be defined within the data catalogue. We can therefore use our data catalogue to document the interfaces and ratify data contracts against them.

Expect Change

The key lesson is that everything changes, often faster than expected.

Key to ensuring that your data platform delivers the value it promised, is ensuring it can adapt to these changes.

So when building pipelines minimise the interdependencies between transformations, define clear interfaces and consider how each component could be replaced when it’s superseded.

About the Author:
Carl Follows is a Data Analytics Solution Architect here at Version 1.

--

--

Carl Follows
Version 1

Data Analytics Solutions Architect @ Version 1 | Practical Data Modeller | Builds Data Platforms on Azure