Data Engineering for Autonomy and Rapid Innovation

Jeff Magnusson
97 Things
Published in
2 min readJul 22, 2019

In many organizations, data engineering is treated purely as a specialty. Data pipelines are seen as the complex, arcane domain of data engineers. Often data engineers are organized into dedicated teams, or embedded into vertically oriented product based teams. While delegating work to specialists often makes sense, it also implies that a hand-off is required in order to accomplish something that spans beyond that specialty. Fortunately, with the right frameworks and infrastructure in place, handoffs are unnecessary to accomplish (and, perhaps more importantly, iterate on!) many data flows and tasks!

Data pipelines can generally be decomposed into business or algorithmic logic (metric computation, model training, featurization, etc), and data flow logic (complex joins, data wrangling, sessionization, etc). Data engineers specialize in implementing data flow logic, but often must implement other logic to spec based on the desires or needs of the team requesting the work, and without the ability to autonomously adjust those requirements. This happens because both types of logic are typically intertwined and implemented hand-in-hand throughout the pipelines.

Instead, look for ways to decouple data flow logic from other forms of logic within the pipeline. Here are some strategies:

Implement reusable patterns into the ETL framework

Rather than templatizing common patterns, implement them as functions within an ETL framework. This minimizes code skew and maintenance burden, and makes data pipelines more accessible to contribution beyond the data engineering team.

Choose a framework and toolset that is accessible within the organization

One reason data engineering is often viewed as a specialty is because the data pipelines are often implemented in a language or toolset that is not common to the rest of the organization. Consider adopting a framework that supports a language that is widely known and used within in your organization (hint: SQL is widely known and understood outside the data engineering specialty).

Move the logic to the edges of the pipelines

Look to move data flow logic as far upstream or downstream as possible. This allows the remainder of the work to happen as a pre or post processing step, effectively decoupling the data engineers from data consumers, and restoring autonomy to iterate without further handoff.

Create and support staging tables

Staging tables are often employed as intermediate checkpoints or outputs between jobs in data pipelines. However, they are often treated as ephemeral datasets used only by the pipeline they run in. If you need to implement a tricky or expensive join or processing step, consider staging out the results and support their use by other, less specialized folks within the organization.

Bake data flow logic into tooling and infrastructure

Bake common patterns into frameworks or tooling that is invoked via configuration. Data engineering logic can often be highly leveraged by pushing it into data acquisition, access, or storage code. Rather than expressing the configuration of data flow logic within data pipelines, consider embedding it in the metadata store as metadata on input or output data sources.

--

--