Testing Strategies for incomplete functionality

Published in

bigdatarepublic

6 min readJan 4, 2022

When working with code that is not fully functional yet, standard testing approaches in which inputs and expected outputs are defined, might not be sufficient. This post outlines testing strategies that could be applied in such cases. As an example, we take extracting text from a large PDF document corpus. Extraction might not (yet) be as precise as one would wish, resulting in extracted expected text interwoven with undesired additional lines. The cause of this might be the variation in the structure of documents, making it hard to write a function that deals with all cases as expected. As such a testing approach is required that reflects the presence of undesired output among the desired output.

Can we make the invisible visible and at what cost? (source)

Choosing different approaches, results in different tradeoffs. This post outlines two strategies: dealing with noise explicitly and dealing with noise implicitly. Both strategies will be described as well as their pros and cons. TLDR: Use an explicit strategy by default, continue reading to understand why.

Let’s say we have a function that extracts tables from a PDF document, such as shown in the figure below. In an early phase of the project, it must extract certain rows in the table, while other rows are considered irrelevant in the output. This could have a variety of reasons. The content of particular rows can be irrelevant for the use case at hand. Or, in other cases, the content is actually relevant but has been observed not to be extracted in a reliable manner due to variation in the report structure from different sources. One such case could be how newlines in a cell are dealt with. From a visual inspection, the user might expect “standaardpakket bodem/AS3000” as an outcome, while the function just returns either the top or bottom line in its current implementation. While the case is straightforward with the example at hand, in a corpus of 500k documents, coming from a variety of sources, it is hard to predict up front how the function should behave.

An example page from the PDF’s were trying to extract — Figure 1: An example document from the corpus (source, p. 13)

To test this functionally we specify the following test (leaving out the last column for brevity):

Feature: Successful Parsing
 In order to increase the value of our pdf document corpus
 I want to extract some of the structured data
 for use in other applications.Scenario:
  Given a document <filepath>
  When the function extracted a table from page 13
  Then we will have the following rows in “result_table”:
   | M1_bg     | 1 t/m 9   | 0,0 - 0,5 | Standaardpakket |
   ...
   | MA1       | 29        | 0,0 - 1,0 | asbest          |
   ...

The first line, | M1_bg | 1 t/m 9 | 0,0 — 0,5 | Standaardpakket, actually comes from the PDF document but is deemed irrelevant from a functional perspective. The functionality cannot yet deal properly with the multiline cell in this case. Given that this function works on a corpus of ~500k documents, this happens and the logic is not yet watertight in filtering rows like this or in processing them properly in order to get complete output. The question now is, what different approaches can we take in building our test suite and what are the tradeoffs?

Approach 1: Make noise explicit

The first approach is the one demonstrated above: we make explicit in our expected output that there is data we cannot deal with properly yet. The reason for this approach is that it makes the expected output quality explicit. In the context of the extraction, these additional lines could be considered just noise. However, at some point, this data is leaving the extraction module and entering other systems not in control of the team. It cannot be predicted what consequence this “noise” will have. This could be as small as the need for all kinds of quality checks in systems using this data, to calculations going rogue because assumptions are made about how this data can be used.

The use of extracted tables in other parts of the organization could have highly undesirable effects. Names of substances in these lines might be incomplete or quantities might be off by several factors, leading to severe mistakes in decisions or calculations. As the low-quality lines are made explicit in the test suite, the team can at this point remove these specific undesired lines from the test suite. Consequently, the tests will fail, and implementing the required filtering functionality can start until all tests are green again.

An advantage of this approach is that, due to specifying all test data and assumptions in the spec, business users can add new test scenarios without writing the test implementation. A downside of this approach is that any changes to the extraction functionality might also touch the “noise” lines in the test suite. As a result, it may be required to some extent to also adapt those lines in the test suite as the resulting “noise” has slightly changed in format.

Approach 2: Make noise implicit

A second approach would be to make the “noise” implicit. In this case, the noise is not added to the test data. The spec would change to:

Feature: Successful Parsing
 In order to increase the value of our pdf document corpus
 I want to extract some of the structured data
 for use in other applications.Scenario:
  Given a document <filepath>
  When the function extracted a table from page 13
  Then we will have the following rows in “result_table”:
   ...
   | MA1       | 29        | 0,0 - 1,0 | asbest          |
   ...

This is just a one-line difference in this simplified example. However, when noise is about ~20% of the current output data on a large corpus, the impact on lines in the specs could be significant.

For tests to pass, the test logic must now change from an equality check between the expected and the actual tables to an iteration over the expected table to verify that all lines are available in the actual outcome. This seems like a good solution as it significantly reduces the content of the specs and also does not result in “pointless” modification of the spec. Changes to the software also impacting the noise are no longer causing tests to fail.

There are several downsides to this approach. Although the spec makes sure that certain things are in the output, it does not say anything about what else is more in the output. Having a green test-suite might therefore give a somewhat false sense of security as it suggests that the quality of the extraction is high. However, no assumptions should be made about what is done with this data outside the context in which is extracted. If other parts of the organization start calculating with this data, the only guarantee of correctness is about the output part that was verified. Users of the data cannot distinguish between what part is the result of verified code and what part is considered irrelevant or of too low quality by the extraction code. Another issue is that changing the semantics of the tests from *the actual outcome should contain at least this* to *the actual outcome should contain exactly this*, developers need to adjust the test logic (instead of the spec). Additionally, this change is binary: either you have a test suite with noise or you have a test suite without noise. Depending on the context it might be more manageable to reduce certain kinds of noise instead of all noise at once.

Recommendation

Given those two approaches, which one would I recommend? In my opinion, making current inaccuracies explicit is the best approach. It gives a clearer sign of what is considered important from a business perspective. Solving all issues might be too costly and by making this visible in the test-suite, this costly aspect is to some extent made quantifiable. Additionally, it allows for specific-case reduction of noise which in the second scenario is not possible.

That being said, if this were an application where the extracted data is used only within the application one could argue that the overhead of noise in the tests is just too much. So as always, it depends, but having explicitness as default will do you a favor.

Testing Strategies for incomplete functionality

Approach 1: Make noise explicit

Approach 2: Make noise implicit

Recommendation

Written by Bertjan Broeksema