The Keys To Unlock TDD For Data Engineering

In Part 1 I hope you were convinced that automated testing is something worthwhile investing in for your long term sanity. If you agree with me that putting testing in the front seat is a good idea, and our jobs are easier when you write the tests first (which is Test Driven Development or TDD), now we have to give some thought into how to actually do it. Which to be honest, can be quite daunting.

A Deeper Dive into the Data Testing Problem

Before continuing further, it’s helpful to get an understanding of why working with data adds more complexity to the testing process. The main issue is that data is quite literally, the essence of state. It is for this reason that the programs that we are producing are always tightly coupled to the data. If the data changes in any significant way, so will your code.

This poses the greatest challenge in regards to testing because the test data has to be representative of production data. A change in production will cause a change in test that needs keeping in sync. In a normal given this input, assert this output test scenario, propagating such changes across your test suite is time consuming to do at any decent scale. This is because you have to modify three parts, the input data, the output data, and your code upon a change.

Now let’s consider how this complicates things in three different lenses:

  • Traditional databases
  • Big data
  • Security

Traditional Databases

When working with monolithic data pipelines every change causes the entire system to be rebuilt. This makes keeping up with state changes in multiple environments especially time consuming and costly. Given it can take a long time (usually hours) to replicate state, you can encounter issues with it going out of date quickly by the time it switches from the development to the testing stage.

Databases are commonly shared in these environments, which means that the changes are not in isolation, which can be a nightmare to manage with more than one developer. Most modern data platforms use EL(T) instead of ETL partly for last reason since it solves the problem of coupling the extract and the transform stages together. Pipelines are free to exist in isolation and therefore can be updated and created easily by multiple developers.

Big data

For big data, it primarily suffers from a replication problem, as it’s generally infeasible to duplicate all of production in test, not to mention that if we did, running the tests will be astronomically expensive and slow. We also have to contend with the fact that it’s “big”, and thus there are potentially so many test cases that it’s a huge task to get decent test coverage.

The problem is not intractable though, as at the end of the day, when you’re developing on big data you always distill it down into forms that can be rationalized. For example, you might aggregate a field based off it’s name and type, which are truths about that data which rarely change. When you encounter a new case in this scenario, you debug it in production, and then create the appropriate test. That way with time you build coverage.


Usually a level of isolation is required for production data. This means that you probably have to obfuscate and encrypt the data in some way to copy it across. The process isn’t foolproof as you run the risk that production data could leak into non-production systems. Often this is actually a show stopper because many teams would prefer just to develop in production rather than go through this process on a regular basis.

Once you’ve copied the data across, you now have to deal with the pain of developing with fields that look nothing like they do production. This is more of a problem for the data scientists of this world, as the exploratory process is one they revel in, but for data engineers it is more difficult to verify whether certain transformations have been applied correctly.

Test Data Is Code

Unfortunately there is no way around the fact we need to keep state around to test our workloads. As alluded to before, keeping state in file or database based systems is difficult, mainly because you can’t easily keep track of how the data should change when it’s hidden behind a layer of abstraction (i.e., you need to open the file). But we can avoid the traps of checking in files to source control or copying our data from one environment to another by generating the test data at runtime instead. This comes with a multitude of benefits:

  1. With this method it is a lot easier to refactor your test suite with any number of changes.
  2. Privacy and security concerns are mostly alleviated because the data isn’t real.
  3. You can now shape the data to scale with the type of test you desire. For example, say you’re testing a streaming pipeline, you can generate one event to test that you handle it properly, and one million events to test the system processes it in a timely manner.
  4. You can generate new test cases that you may never see in production, therefore buying you more test coverage.

With generation, depending on the type, it does make the job of using the test oracle (the part that does the assertion) more difficult. If you are creating random data at runtime, then how can you assert the result of something that is unknown? This can be solved by limiting the scope of the tests and only doing full record comparisons rarely. For example, you could verify that the structure of the data is the way you want it to be, or that the row counts are as expected. Another fancier solution would be to pass the randomly generated data through a function in your test suite that creates the assertion, which should work nicely for testing aggregates like averages or sums.

That said, it is okay to explicitly generate input and output for certain functions that must have it. This is because it’s far easier to refactor consistently in source control than it is to fiddle around with files or databases. The best practice in general is to try limit the number of fields in the test record to only those that are needed for the test to pass, in this way you reduce the chances of any one change breaking more tests than it should.

What runs the tests?

Most programming languages have a variety of test frameworks available. Python has pytest, Scala has ScalaTest, Java has JUnit, Julia and Golang just have testing straight up built into them. They all have similar features, a series of test oracles, the ability to spin up test fixtures (used to initialize the system for the tests), and various ways to produce a report at the end. I recommend utilizing them as the foundation of your testing suite.

If you’re stuck for ideas on how to implement a testing suite, fortunately most open source projects have test suites built into them. Delta Lake has test fixtures that create the Spark Session, and it also has the patterns to make assertions on Spark DataFrames available in Python and Scala. It’s relatively simple to use the same patterns for your own purposes.

In my experience the most challenging aspect of setting up the test harness is creating the fixtures in a way that is representative of production. Having a decent understanding of containerization is an obvious benefit here. Unfortunately, if you’re using most cloud based PaaS tools, the chances are you won’t have any fake implementations on offer to add the power of mock objects to your suite. I have spent countless hours fiddling with containers that interface with these tools and the result is often a brittle (and slow) test suite that needs constant maintenance. I believe this is mostly by design so that the cloud vendors lock you in, but it does come with the benefit that you’re running your tests on the same metal as production. However, if you are in AWS you are in luck, as there is moto. You always can create your own mocks if you’re working with Azure and GCP.

Keep in mind that any test framework you use will essentially lock you into it. This is because you will have to refactor all your tests and all your fixtures to switch frameworks. It can become a problem as your testing suite should have all the same version dependencies as production. Be wary of this if you choose to use a third party test framework since you will be pinning most of your production dependencies to theirs. For example, if you use spark-testing-base and your plan is to upgrade to Spark 3.0.1 right now, then you won’t be able to until the change is made in that project or you switch frameworks.

A scratch on the surface

In review, the crux of the issue is that data is stateful, which causes a bunch of problems in test representation, therefore making data testing difficult. We can lower the burden of this by representing our test data as code, because code is easier to change than data.

The backbone of your testing suite will be one of the in built testing frameworks in your language of choice. You then have to contend with how to set up the test fixtures required for your system to run. If you’re using any PaaS tools for your workloads then you will need to make the trade off between mocking your PaaS objects and actually using them.

This leads nicely into the next post, Part 3, how to deal with the famous (or rather infamous) test pyramid in the data context.

Senior Consultant at Servian. Specialist in Data Infrastructure & Automation.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store