The case of internal temporary data structures

Published in

Linagora Engineering

3 min readJun 16, 2020

When it comes to designing programs a frequently referred masterpiece is Algorithms + Data Structures = Programs by Niklaus Wirth.

For practical reason few of us (among all the developers crowd) have had a look, after all, few of us enjoy reading 400 pages of Pascal.
It has noticeable effects as Linus Torvald pointed it out:

I will, in fact, claim that the difference between a bad programmer
and a good one is whether he considers his code or his data structures
more important. Bad programmers worry about the code. Good programmers
worry about data structures and their relationships.

I have encountered the habit of creating temporary data structure when I have learned Haskell.

The idea is to have two kinds of data structures:
* The long-living ones, which are passed around in your program
* The short-living ones, which are local and algorithm-specific

Let us see the three main benefits of making that distinction.

Respect of the CQS

Command–query separation has been formalized by Bertrand Meyer:

Asking a question should not change the answer.

Here is our running example:

State is used inside our algorithm, it accumulates Events information.

StabilizedState is instead used in read-only by a large set of consumers.

It allows to make the dichotomy of concerns explicit: the builder and the consumer.
Consequently the produced object is intrinsically immutable.

An extension to that, would be to declare State as an interface and have a builder-specific class in order to trace the builder’s strategy.

Lightweight invariants enforcement

Shortly: an invariant is a propriety which is always true.

In our code, we never have tested if a dereference followed a reference, or if the generation of the dereference is greater or equals to the concerned reference.

Checking these properties at runtime might be costly, while knowing that is inherent to your algorithm helps to cut down the complexity of your code.

Algorithm optimization

Given that you have two separate concerns at two separate moment in time you are able to optimize both.

It happens often when you use persistent data structures, either writing or reading operations are optimized. Here you can use the right one for your use case, or to use the same data structure accordingly to it.

For example, if you want to implement again poorly filter, you could do it this way:

Here, we use List which have a constant-time prepend operation.

As such, since you fold the list from left to right, you will prepend them, given a reverse order. Finally, using reverse will give the consumer List.

Another practical thing is to rely on a data structure to drive the algorithm execution.

For example, you have a list of XPath paths you want to match against an XML file:
* /node1/subnode1/subsubnode1
* /node1/subnode1/subsubnode3
* /node1/subnode2/subsubnode1
* /node4/subnode1/subsubnode1
* …

You could simply evaluate each path independently, but you would walk into the same XML nodes without reason, making the whole algorithm slow.
Instead, you can compress the path list thanks to a Prefix tree and get the distinct list of nodes you have to go through. Thanks to that, your performances will be improves by orders of magnitudes.

Moreover it has the benefit to explicit the problem structure clearly, and to not rely on group of methods doing recursion calling each others.

Conclusion

While a strict application rule would be great, the reality is more complex.
I recommend to split data structures, or to use internal in the following case:
* Construction and consumption usages are different
* Invariants are costly to enforce outside of the algorithm
* The construction algorithm is complex

The case of internal temporary data structures

Respect of the CQS

Lightweight invariants enforcement

Algorithm optimization

Conclusion

Written by Gautier DI FOLCO