Why data engineers should be more like software engineers

Niels Claeys
datamindedbe
Published in
7 min readJan 24, 2023

We at Data Minded, created our data engineering manifesto more than 2 years ago. The goal was to highlight our core principles and identify what we find important within data engineering. I think the manifesto is still relevant today, but I also notice that some data engineers are (still) struggling with some of the statements. In this post I want to elaborate on:

Data engineers are Software engineers

Before diving into how you can become a better data engineer, let’s look at what data products are and how they differ from software products.

What are data products

Without starting a semantic discussion, I like the following definition for data products as stated by Simon O’Regan in one of his blogposts:

Products whose primary objective is to use data to facilitate an end goal

Although being very broad, I like this definition because it hints that data products are a specific subset of a software product, namely those where data is the primary ingredient (data in and data out).
When defining a data product like this you can easily see the many similarities with software products: the focus on solving a customer need, using a code based approach, developing in iterations,…

An important aspect is that data engineering teams should switch from focusing on the technical solution towards a product mindset. This way you will focus more on how the end users will use your data product as well as create a long-term vision. The customer focus is often lacking in data teams because we tend to think that the data pipeline is the product, but the data pipeline is just a means to an end. When exposing data through an API or a dashboard, the end-user does not care about whether it is fetched from an operational database directly or whether there is a data pipeline in between. The reasons for opting for a pipeline are purely technical:

  • separation of concerns: do not let analytics queries impact the operational performance
  • different access patterns/requirements between operational and analytics workloads might result in different storage/querying solutions
  • a pipeline can combine data from multiple data sources

How to become a better data engineer

In many professions, certainly immature ones like data engineering, it is often a good idea to look at related fields for inspiration and copy best practices. In my opinion there are several best practices that data engineers can or even should copy from software development:

Use CI/CD to automate deployments

I still remember the days where deploying a Spark application to production meant building a jar locally and copying it to our Hadoop cluster. In most companies this practice is not possible anymore because it caused numerous issues:

  • there is no trace of which code is running in production. This can make it difficult to find out why a certain job is failing all of a sudden.
  • building the artifact might not be repeatable as it depends on locally installed tools: java version, dependency versions, OS, … This might mean that only 1 users can deploy a working application to production.
  • linting, running tests can be bypassed by users, which can have an impact on the quality of the code deployed to production.

To solve most of these issues, CI/CD pipelines have become common practice in software development. Luckily CI/CD pipelines are also becoming standard practice within data teams, certainly in larger organisations.

Writing tests is crucial

I have worked as a software engineer for several years before switching into the data field and thus have learned the value of writing tests. I noticed the following benefits when writing tests:

  • it validates your code while building your feature and enables short feedback cycles.
  • it prevents regression issues when you or your colleagues make improvements to your code over time.
  • when using test-driven development (TDD) or writing a lot of tests, your code will improve because you create smaller functions/components and think more about the signature in order to make them testable. This results in code that is more easy to maintain.
https://i.pinimg.com/originals/46/39/85/46398518cb55fc98a73eeae7b09e8d62.jpg

Despite these benefits, there are many data engineers that do not believe in or write tests. A popular response I get, when asking why they do not write tests, is:

Writing tests is (too) hard and not relevant because bugs are only discovered when I run against production data.

While I fully agree that writing good tests is hard and requires both discipline and time, I do not find ‘too hard’ a valid argument because it completely disregards the benefits of writing tests.
I recognize that some issues are only discovered in production and could not have been prevented by writing tests. However, as mentioned before there are also other benefits of writing tests.
Also, after discovering an issue in production, you could fix the issue and also write a (unit) test for it. This is valuable because at some point in the future, someone will make changes to your code and inadvertently reintroduce the already squashed bug.

I always look at the following 3 categories for tests for my code:

  • Write unit tests for testing small transformations by separating them into individual functions. This can be a user defined function (UDF) in Spark but also multiple transformations on your dataset that accomplish a common goal. Creating a test for individual functions is easy and it can help you to quickly validate specific functionality.
  • Write integration tests to test the integration between your data pipeline and external systems. If possible I would use an in-memory or local replacement of the service that you depend upon. This way you are in full control of the service and not dependent on outages, having an internet connection, data being available,… Typical examples are using Localstack to mimic AWS services or a docker container to replace your database. They provide quicker feedback than running your code in development/staging and are also useful to test your assumptions about using the external service.
  • You can also create a couple of component-/black box tests using a sample dataset and validate the resulting output for a job in your pipeline. These are more sanity checks to see that that the job can correctly process relevant input data than exhaustive tests of all corner cases.

The 3 categories are based on the Testing Pyramid used in software development and it helps to understand how many tests to create for every category. If this is new to you, I recommend reading the Google testing blog for more practical insights.

Iterative delivery trumps a big bang approach

In my time as a software engineer, I have (almost) never seen a big bang approach work well. I think this is mainly because there are too many moving parts that can go wrong, resulting in a failed release. Improving iteratively and integrating continuously might seem like more work on the short term but it does allow you to reduce risk, incorporate feedback and collaborate between team members.

Another reason why I think it is important to learn to deliver a product iteratively is because you never get it right the first time. A data product will evolve over time because of changing requirements as well as improvements made. If you never learned how nor implemented support for making incremental changes, the evolution of your product will be difficult and can become very expensive.

A common remark that I often hear against iterative development is:

My data pipeline is impossible to split into smaller tasks as everything needs to be delivered in order to bring value to end users.

This is the same struggle as I had together with many software developers because it is not easy to split up work and again requires practice in order to improve.
Splitting up your data pipeline does not mean creating one task for every column that exists in your output dataset. I have used the following approaches for splitting up my work:

  • You can start with scaffolding your pipeline (ingest, transformation, egress, api,…) and create tasks for the different steps in your pipeline. A good place to start is the integration with external services both for ingest and egress. This can help to reduce risk/unknowns as well as forces you to think about API-/Data contract. Having such a contract makes it possible for multiple people to collaborate on the same pipeline.
  • When creating a machine learning model, you could start with the delivery of a very simple model (e.g. linear regression) for your pipeline and afterwards spend time on improving that model. This again reduces the number of unknowns as it focuses on implementing an end-2-end solution first. Afterwards, you can improve the model and this will immediately be visible to end users.
  • Apart from writing the business logic for your pipeline, there are also many other aspects to consider when creating reliable data products such as: monitoring/alerting, data lineage, scalability,… Any of these can be an additional tasks and do not need to be delivered immediately.

I know not all of these individual tasks are visible for end users, but by splitting off some (technical) parts the end users will see value more quickly. I try to be pragmatic in this and attempt to make most of my tasks relevant for end users, which is not always possible and that is fine.

Conclusion

In this article I discussed why we at Data Minded believe that data engineers are software engineers. I started with highlighting the importance of a product mindset instead of solely focusing on the technical implementation (e.g. data pipeline). Next to that I mentioned several aspects where data engineers can learn from software engineers:

  • creating CI/CD pipelines
  • writing tests
  • delivering iteratively instead of all at once

The most important take-away for me personally is that we, the data engineering community, should spend more time and do a better job in explaining the value of these best practices to people that are new in the space. This is the only way how we will become more mature as an industry.

--

--

Niels Claeys
datamindedbe

Data (platform) engineer @Data Minded with an fondness for distributed systems. Loves: AWS, K8s, Spark, Duckdb