Aditya Vandhye
Sunday Labs
Published in
3 min readDec 26, 2023

--

Pitfalls of Treating Data Engineering like Software Engineering

Today, Data engineering and DevOps have come to convergence, using common tools and practices such as cloud infrastructure, containerization, CI/CD, and GitOps. Some mistakenly perceive no significant difference between data engineering and software engineering, attributing any perceived roughness to a lag in adopting software development practices.

While both fields share commonalities, they differ substantially. Managing a data engineering team like a software product team overlooks these distinctions. Today we emphasize the unique challenges in data engineering.

Data pipelines, unlike applications in software engineering, don’t provide direct value but focus on processing and transforming data. While applications offer various features and value to users, pipelines solely aim to produce valuable datasets for downstream consumers.

Unlike applications, data pipelines face unique challenges:

  • Value Delivery Applications provide direct value to users through interaction modalities, while data pipelines’ value lies in the produced datasets for downstream consumers.
  • Feature Relevance Applications have multiple features, evolving over time, while a data pipeline has a single relevant feature: producing the requested dataset, marking a clear point of completion.
  • State Management Data pipelines manage a large number of states, processing existing states from external sources, building datasets incrementally in long-running processes.
  • Coupling Unlike applications, data pipelines have unavoidable tight coupling with data sources, affecting stability and reliability.

In agile frameworks, where the core philosophy is iterative development for maximum customer value, data pipeline development doesn’t fit well. A pipeline is either completed or worthless, contrasting the incremental development approach of agile frameworks. Attempts to integrate data engineers into scrum teams can lead to micromanagement as tasks replace user stories, impacting the efficiency of development.

When management lacks a fundamental understanding of what they oversee, poor decisions often follow.

For those unfamiliar with the intricacies, here are three reasons why this approach is flawed

  1. Partial datasets lack proportional utility, hindering tasks like predictive modeling, demanding completeness for effective experimentation. Pipeline development time isn’t directly tied to dataset size; larger datasets correlate with increased time and resource demands. Overwriting everything is computationally expensive, while selective updates increase development time and complexity.
  2. Datasets possess inherent inertia, requiring more time, effort, and money for changes as they grow. Deploying a partial pipeline to production is wasteful, offering no customer value, wasting compute resources, and complicating maintenance. Transposing DevOps and Agile principles onto data pipeline development overlooks data’s inherent inertia.
  3. Data pipeline feedback loops are slow compared to software, as they lack unit tests. Frequent pipeline deployments suggest uncertain customer requirements or an unstable data source, unlike stateless applications where updates are straightforward.
  4. Pipeline development is further slowed by the need for real-world deployment to get reliable feedback. Integration tests may be faster but require deployment, undermining their purpose. Enforcing data contracts for quality data faces challenges, and pipeline development cannot be parallelized due to sequential task dependencies.

Data pipelines are not parallelizable user stories; sequential task dependencies hinder parallel development. Planning the entire pipeline upfront is impractical without proper data source characterization. Data pipelines serve as a means to an end, managing state and bridging disparate systems.

Recognizing the differences between data and software is crucial, and enforcing agile processes in data teams without recognition will backfire.

Here’s what you must keep in mind to ensure data team success:

  • Recognize a lite form of Waterfall in data pipeline projects, emphasizing upfront conversations with customers and data producers to clarify requirements before development.
  • Allow time for data engineers to experiment with data sources, acknowledging that time estimates for dataset availability are often inaccurate.
  • Avoid splitting a pipeline among multiple developers, promoting collaboration through pair/extreme/group programming with trunk-based development to enhance productivity and early issue detection.

--

--