Let’s Get ‘Idempotence’ Right

Published in

SSENSE-TECH

8 min readApr 24, 2020

‘Idempotence’ is a word most certainly found in every data engineer’s vernacular. When building and maintaining data pipelines, the question of idempotence of the pipeline must be accounted for. A once exotic word is quickly losing its aura of flamboyance for its frequent employ. However, and sadly, this word is commonly misunderstood and thus misused — losing its very potence (ha).

Scour the blogs on data pipelines and they all touch upon the desired property of idempotence. Many — prominent ones, that too — miss the mark. Commonly, idempotence is confused with reproducibility. Consider “A Beginner’s Guide to Data Engineering — Part II”, which asserts “pipeline[s] should be built so that the same query, when run against the same business logic and time range, returns the same result. This property has a fancy name called idempotence.” “3 Design Principles for Engineering Data” claims “idempotent operations means that the same input will consistently produce the same output (no side effects).” Indeed, that is not what idempotence means.

So let’s set the record straight. The idempotent property — I insist to prose it first in its most succinct, mathematical form — is defined as follows: f(f(x)) = f(x). That is, as per Wikipedia, a function f can be deemed idempotent if applying it multiple times does not change the result beyond the initial application. Let’s consider for example, the button to summon an elevator. Contrary to popular belief, pressing the button several dozen times does the same as pressing it just once. This is an idempotent operation. In mathematics, the floor and ceiling operators are idempotent. Phrases such as “same query, when run against the same business logic and time range, returns the same result” and “same input will consistently produce the same output” actually describe reproducibility. When a function has been deemed reproducible, we can be sure that when applied on a given set of parameters, one and only one output will be produced. Those still in touch with their mathematical training will note that all deterministic functions are reproducible. Indeed, the very definition of a function demands reproducibility. Recall the vertical line test. If a curve intersects a vertical line more than once then it cannot be a graph of a function, for a function can only have one output for each unique input. If an operation in question takes as input 4, and can return both 2 and -2 as an output (consider the curve given by y² = x) , then neither is this operation a function, nor is it reproducible. It would be the equivalent of a recipe that, when exacted on a fixed set of ingredients, could produce at times a cake, and at other times a car.

Let’s dare to understand why the two are often confused for one another. The source of the confusion may be that in both cases the function/ operation/ procedure in question is characterized by its behavior when this function/ operation/ procedure is re-applied. Indeed, reproducibility is stressed most in (and actually adopted from) the scientific community, where an experiment is considered reproducible if re-doing the experiment results in the same output each time. Likewise, idempotence too is defined by how a function behaves when re-applied. Note however, that it is wrong to comprehend reproducibility by asking yourself what happens when the function is re-applied.

The definition of reproducibility is tightly bound to the definition of a function in mathematics, and has nothing to do with any re-application of the function. A reproducible procedure results in a single outcome for a given set of inputs. It is strictly because just one outcome can be reached for that set of inputs, that we say the outcome is reproducible, because that outcome can be reached again if executed on the same inputs. Think about a proof in mathematics. To prove that 2 + 2 = 4, you prove it by demonstrating how NO outcome other than 4 is possible. You don’t prove that by running 2 + 2 a dozen times and comparing the answer each time. That’s beyond amateur. Note, this again is why we’re moving away from object oriented programming and towards functional programming. By programming in a functional manner, we can guarantee that the program always evaluates to just one outcome (the very definition of a function). Once we validate that it evaluates to the desired outcome, we can be assured that it will always evaluate to that outcome (reproducible). A program is not validated by comparing outcomes for several executions — that doesn’t count as proof. So, we must divorce our understanding of reproducibility by asking ourselves what happens when the function is re-applied. In truth, we ask ourselves, does this procedure lead to a single outcome? A recipe for a cake must produce just that, a cake. Then we can be assured that if we take the same ingredients, follow the same procedure, a cake can be reproduced.

In the case of idempotence, it is correct to examine a function’s behavior when it is re-applied. Specifically, consider applying a function on top of or upon a previous application. If such compounded applications of a function reach the same state as just the initial application of the function, we can conclude that the function is idempotent.

Disregarding the distinction between these words can prove to be costly if tolerated — especially in our realm of data engineering.

Let’s understand idempotence and its not-to-be-confused-with cousin, reproducibility, in the context of data engineering. Idempotence is a characteristic (that too, a desired one) of a data pipeline. Recall that idempotence pertains to the re-application of a function, in this case re-execution of a pipeline. Any experienced data engineer would know that there are a list full of reasons — faulty source data, bug in the transformation logic (not necessarily one that errors out), adding a new dimension to the data, the ever evolving data contract for the datasets the pipeline generates (“agile” development) — for why a pipeline needs to be retriggered. The end result of running a pipeline is the dataset that it creates, and the question is: should this dataset show signs of all the times the pipeline was run to get there? Should the stakeholder of the dataset worry about filtering out duplicates from each re-execution, or by any means account for the number of times the pipeline was run to create that dataset? Such a design would make “decisions about what needs to be re-executed require[s] the full context and understanding of all potential side-effects of previous executions” as stated in Maxime Beauchemin‘s article on Functional Data Engineering — a modern paradigm for batch data processing. For the sake of operability of the pipeline, idempotence is paramount.

So, idempotence applies in the context of data pipelines. Whether the pipeline was run once or re-run several times, the dataset that is created by it should be the same. A data pipeline can be generalized to the chain of events extract-transform-load, where reproducibility appliesx specifically to the ‘transform’ segment of the sequence. It is here where the transformation logic must be deterministic and reproducible; reiterating once again that reproducibility is not ensured by observing repetitive behavior of the transformation, but that the transformation upon a fixed set of parameters leads to a single outcome. Only then can we be assured that our transformations are reproducible, which then leads to validating that this single outcome meets the business requirements. If we can’t guarantee a single outcome, there’s no way we can validate the data. Go figure, functional programming is (clearly) the way to go here.

Now, it’s quite easy to see why reproducible transformations in a pipeline need not imply that the pipeline is idempotent, and this is why it’s key to respect the difference between these two words. A crucial design principle for an idempotent data pipeline is that all the “writes” should actually be an “overwrite”. At SSENSE, most of our data pipelines populate our data lake and follow the structure of first pooling the data in our data lake “Raw” bucket, then moving it to data lake “Interim” where it gets cataloged and in the final, “Business” layer resides the refined, consumable data, with each dataset corresponding to a specific business need. All the operations that write into any of the three layers of the data lake are an “overwrite”, either of the entire dataset, or the specific partition in question. Either we implement this by setting the “replace” option to be ‘True’ when moving files from one layer to another or, in the case of creating a business layer dataset, we explicitly delete what was existing and then write.

It’s easy to see why the “delete then write” (or overwrite) approach is idempotent. Recall the definition of idempotence as f(f(x)) = f(x).

Let f be the “delete then write” operation and x the data we want to write to location T.

As can be seen, applying the function multiple times, f(f(x)), does not change the result beyond the initial application, f(x).

Failing to enforce overwrites leads to data pipelines which are not idempotent, even if its transformations are reproducible. If we append as opposed to overwrite, then re-executing a pipeline will result in a different dataset — it’ll have all the data from the previous run; multiple runs of the pipeline give a different result than the initial run, thus failing to meet the idempotence property.

To conclude, when discussing transformations, think reproducibility (think reproducibility, think functional programming!). When discussing pipelines, think idempotence. Ask yourself if re-running the pipeline (there are countless reasons for doing so) will be the same as running the pipeline just once; if the final dataset that is produced by the pipeline will show signs of all its previous executions (it shouldn’t). As discussed, an overlooked design principle to attain this is a forced overwrite. To many, the concepts of reproducibility and idempotence are clear, and there is no practice I’ve mentioned in this article that they don’t already implement. Yet time and time again, I’ve seen and heard data engineers loosely interchange these two words. When asked to explain idempotence, they describe reproducibility. The difference between the two is stark and must be respected.

*All XKCD comics on this page are published under a Creative Commons Attribution-NonCommercial 2.5 License.

Editorial reviews Deanna Chow, Liela Touré, & Prateek Sanyal.

Want to work with us? Click here to see all open positions at SSENSE!

Let’s Get ‘Idempotence’ Right

Written by Vivek Gidla