Idempotent Data Pipeline

Making your Data Pipeline reliable

Published in

Geek Culture

3 min readFeb 17, 2022

Idempotent as defined in the oxford dictionary “is an element of a set which is unchanged in value when multiplied or otherwise operated on by itself.”

You are probably asking how does this relate to my data pipeline? well hold my beer will you?

Outline:

What is an Idempotent data pipeline
Advantages of Idempotent data pipeline
How to make a data pipeline Idempotent
Conclusion

1. What is an Idempotent data pipeline

Running a pipeline that get data from a source and loads it into a relational database more than once can result into having duplicate values in the database there by causing wrong metrics and many other errors. Making a pipeline idempotent will prevent this and make you a better engineer as well.

In other words, running a data pipeline multiple times with the same input will always produce the same output.

2. Advantages of an Idempotent data pipeline

Listed below are some advantages of an Idempotent data pipeline

It ensures that no duplicate data is produced to a storage location in the case of backfilling
It makes the result of a transformation in a pipeline predictable
It helps to reduce data storage expenses
It also helps to remove old/unwanted data

3. How to make a data pipeline Idempotent

The most common steps in a data pipeline are

pulls data from one or more source
performs some transformation
load into a data warehouse

An idempotent pipeline will ensure that if any error occurs in the listed steps, the expected result is still produced as the output.

In case an error occurs during the load stage and the data was not fully loaded into the data warehouse, our Idempotent pipeline should delete the half-loaded data from the data warehouse and store the new one as a fully loaded data when the pipeline is rerun. This is advised only when the data pipeline will produce the same data as required if re-run. This pattern is known as the delete-write pattern. Technologies like Spark, Snowflake etc offers other Idempotent design patterns like spark-overwrite.

A way to implement the delete-write pattern in python is provided below:

delete_write_idempotent pipeline

In the snippet above, I have implemented a simple ETL pipeline that get the parquet file from an s3 bucket and used pandas read_parquet to read it. The data is then transformed based on the identified business logic in my case, it removing all rows that the customer_id is equal to A. The delete_write pattern is implemented in the last part of the ETL. Here, the load_dwh function accepts two arguments the dataframe and the output_location, checks if the output_location already exists which will be the case if an error occured while loading the data then deletes the folder specified in the output_location and recreates it with the new data.

4. Conclusion

Having an Idempotent data pipeline can prevent a lot of headache for data engineer especially when the data pipeline needs to rerun multiple times because of errors or changed business logic.

Reference and Further Reading

Startdataengineering has an article on making a data pipeline Idempotent

Fivetran explains Idempotence in a data pipeline

Idempotence and How it failure-proof a data pipeline by Fivetran