5 Checklist points for Data Pipeline Code Review

Published in

Israeli Tech Radar

4 min readJan 25, 2024

Two types of people pass code reviews without comments. Those who write perfect code are the first type, and those who write piles of code are the second type. As I don’t belong to any of these groups, my reviews are often returned with a series of comments. I keep these comments in a “pre-push checklist” for self reviews since many of them are insightful.

I have added five principles to the list while working on refactoring a batch pipeline. These principles are independent of any particular technology and can be applied to any application.

1. Ensure Idempotency

Idempotency is a bit of a strange word but it guarantees that performing a certain action several times will ensure the same result as performing the same action once.

X + 0 Graph. Adding zero will never change the result, regardless of how many times you do it

Our pipeline is built from a collection of small tasks so that each of them can be run repeatedly if necessary without duplicating the data and destroying its consistency.

The repeated running ability of the tasks is important for operability. If a task fails or the logic has been changed, we want to be able to re-run or backfill the task safely.

For example, using the current date/time value to extract yesterday’s data, indicates poor planning. It violates idempotency because it changes from run to run. It would be better to receive an execution date as input to my task, as it keeps the ability to perform historical executions.

Establishing idempotency in a data pipeline can be achieved easily through overwriting the data. With this approach, there’s no need to worry about data inconsistency

INSERT OVERWRITE INTO sf_employees
  SELECT * FROM employees
  WHERE city = 'San Francisco';

However, the downside is its lack of scalability. A better approach is overwriting specific partitions of the data instead of processing all the data each time to ensure idempotency. This method offers the same benefit without requiring all the data to be processed every time.

2. Insist on tasks atomicity

Each task in our pipeline should do exactly one thing. This means that there is no such thing as one part of a task failing while another succeeds — the whole task succeeds or fails. Another (and stricter) definition of only one thing is that it cannot be broken down further (hence the name — atomic)

Lego blocks also cannot be divided. Photo by Xavi Cabrera on Unsplash

It is important to define the correct task boundaries to enable parallel processing, as well as to be able to allocate different resources to different tasks. As a bonus, it results in simpler, recyclable, and testable code.

If possible, a task’s output should be directly aligned with a single output. This makes mapping tables to tasks (and partitions to task instances) easier. Be strict in this regard, so that each piece of data is mapped to a task.

DBT lineage graph. DBT creates a single output for a single model

3. Intermediate storage

The intermediate storage zone is where data is stored after extraction before it is loaded into the final destination. Data that needs additional quality checks or transformations before it can be loaded into the target data store can use intermediate storage as a staging area.

By using intermediate storage, data transfer between two places is separated into export and load tasks. As a result, each phase can be handled separately. A common practice for achieving the same kind of decoupling is to use intermediate tables or views for interim results during internal transformations.

A laundry basket as an intermediate storage. Photo by Annie Spratt on Unsplash

4. Prefer Incremental loading

We prefer loading data incrementally when we are not dealing with small tables. Instead of processing an entire data set, each run should only process a subset of data, such as records from a specific date.

If possible, it is best to extract incremental loading data frame by using the last update date field in the source table. When the data is only appended and not updated, a sequence ID will be used.

When ingesting data from immutable data stores, full partitions are always overwritten or new ones are created, otherwise the operation will fail. When defining a loading task frame, you must take the partitioning method into account, since changing existing partitions is not possible. Keeping partitioning logic for mutable data stores is advisable for optimization reasons, since update partitions are inefficient in distributed storage.

5. Self Checks

It’s good to add checks in a pipeline to ensure the tasks are producing the results as expected. When adding a new check, many decisions must be made. Testing isn’t always clear when it should run and what it should test. I highly recommend a wonderful method that answers those questions in detail in the following article.

Action-Position data quality assessment framework

Where do you place data quality validations? What are the actions you take when they fail?

medium.com

“Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live.”