Automate Your Pipeline Tests

Tom White
97 Things
Published in
2 min readJul 11, 2019

By sticking to these guidelines when building data pipelines, and treating data engineering like software engineering, you can write well-factored, reliable and robust pipelines.

  1. Build an end-to-end test of the whole pipeline at the start
  • Don’t put any effort into what the pipeline does at this stage. Focus on infrastructure: how to provide known input, do a simple transform, test the output is as expected.
  • Use a regular unit testing framework like JUnit or pytest.

2. Use a small amount of representative data.

  • Small enough that the test can run in a few minutes at most.
  • Ideally the data is from your real (production) system (but make sure it is anonymized).

3. Prefer textual data formats over binary for testing

  • Data files should be diff-able, so you can quickly see what’s happening when a test fails.
  • You can check the input and expected outputs into version control and track changes over time.
  • If the pipeline only accepts or produces binary formats, then consider adding support for text in the pipeline itself, or do the necessary conversion in the test.

4. Ensure that tests can be run locally, in order to make debugging test failures as easy as possible.

  • Use in-process versions of the systems you are using, like Spark’s local mode, or HBase’s mini-cluster to provide a self-contained local environment.
  • Minimize use of cloud services in tests. They can provide a uniform environment, but may add friction in terms of provisioning time, debuggability, and access (e.g. users have to provide their own credentials for open source projects).
  • Run the tests under CI too, of course.

5. Make tests deterministic

  • Sometimes the order of output records doesn’t matter in your application. For testing however you may want an extra step to sort by a field to make the output stable.
  • Some algorithms use randomness — e.g. a clustering algorithm to choose candidate neighbors. Setting a seed is standard practice, but may not help in a distributed setting where workers perform operations in a non-deterministic order. In this case consider running that part of the test pipeline with a single worker, or seeding per data partition.
  • Avoid variable time fields from being a part of the output. This should be possible by providing fixed input, otherwise consider mocking out time, or post-processing the output to strip out time fields.
  • If all else fails, match outputs by a similarity measure rather than strict equality.

6. Make it easy to add more tests

  • Parameterize by input file so you can run the same test on multiple inputs.
  • Consider adding a switch that allows the test to record the output for a new edge case input, so you can eyeball it for correctness and add it as expected output.

--

--