Data Engineering is Software Engineering…

Bernd Wessely
4 min readMar 27, 2024

--

…with a focus on data. Although almost everything you read in the Internet claims something different.

Photo by Resource Database on Unsplash

Just two examples…
https://betterprogramming.pub/data-engineering-is-not-software-engineering-af81eb8d3949

https://www.multiverse.io/en-US/blog/data-engineer-vs-software-engineer

According to these examples the two engineering disciplines require completely different skill sets and the engineers actually do work on very different things. That is all true, so let’s look into this and try to understand why I nevertheless claim that Data Engineers should be Software Engineers with a focus on data. And also why Software Engineers should care about skills that are currently only attributed to Data Engineers.

Yes, it’s true that we do have these two engineering disciplines and the skill sets do actually differ in the details. However it’s very unfortunate that this strict separation happened and I would advocate to revert this development. So why did it happen in the first place?

Back in time, when Data Engineering did not exist as a separate discipline yet, we (Software Engineers) created software products (aka applications) that implemented the requirements we understood from the business people. Of course we also had to manage data that was needed as input to our functionality. We also produced output data that needed to be managed as well. The management of data was easy as long as we kept this data in-memory. But we all know that we need to save the data to a permanent storage medium to make it durable. Hence, we implemented this and we realised that saving and loading the data to and from storage had it’s own special features and subtleties. Some of us Software Engineers therefore specialised to these features and even invented something called databases that did the job of storing the ever growing amount of data for our applications.

It turned out that with many applications storing it’s data in databases, we realised that exchange of data between applications can easily be managed by sharing the same database. Because of this, the amount of data stored in databases grew even more to an immense size and needed to be managed carefully as the most valuable asset of an application. We realised that the enormous quantity of data and the variety of storage requirements would be too difficult to be handled within a single database system and therefore operated many of them in the enterprise. We also realised that we could use all this data to intelligently derive business value from the data in the databases without even using the applications that produced it in the first place. Business Intelligence, Data Warehouse, Data Lake(houses), Data Fabric (try to complete…) occurred and all of these systems needed to be managed on it’s own. The Data Engineer discipline was born and with the ever growing need for data transformation and management, the Data Engineers are flooded with data transformation requirements.

I think everyone, who was ever involved in a project to create data driven systems as mentioned above, would confirm that Data Engineers absolutely need to be skilled Software Engineers with a focus on the handling of data. So we are left with the question, why I also advocate that Software Engineers should learn the skills of Data Engineering as well?

Today’s Data Engineers know all too well that they lack necessary know-how to the business domain logic in order to transform and integrate all the data to be readily consumed by downstream applications. As we remember, these applications need to be built by Software Engineers. But Data Engineers still struggle to build and maintain all this super complicated data pipelines to somehow implement the necessary business logic for downstream applications. The “Data Mesh” principle coined by Zhamak Dehghani (https://martinfowler.com/articles/data-monolith-to-mesh.html), understood that problem and advocates to shift the transformation logic back to the Software Engineers that are responsible for the business domain owning the source data. Hence, bring the logic back to the people who have the knowledge on how to interpret and transform the data to something useful for downstream applications. Software Engineers should therefore take an interest in the tools and techniques built by Data Engineers to develop the necessary transformation logic as an extension to their applications.

Conclusion and Recommendation

We still need Data Engineers to build all the specialised tooling and databases that can be used in applications — software engineering know-how is essential for this. But we need to shift back the development of data transformation pipelines back to the applications that are maintained by the Software Engineers — data engineering know-how is also essential for this.

Do not further delegate the development of data transformation logic to Data Engineers, who build it in data pipelines isolated from the applications where the data originated. Instead empower the Software Engineers to maintain this transformation logic as part of the application or as an extension component. Software Engineers need the powerful tools and frameworks built by Data Engineers in order to build these transformation pipelines. Hence, we somehow bring back “database technology” to the applications — something that Martin Kleppmann called “Unbundling Databases” in his excellent book “Designing Data-Intensive Applications”.

--

--