Clean Data Pipelines

Published in

Blog Técnico QuintoAndar

8 min readFeb 12, 2021

Photo by National Cancer Institute on Unsplash

Clean code is a must-read book for software engineers (SWEs) and is part of the curriculum from several software engineering programs. It is what we could call a classic in the software engineer literature. Although it is quite obvious to SWEs the need for the application of the best practices and principles in their craftsmanship, it is not so obvious for people that came from other contexts and backgrounds. In fact, the “new brand” data universe has opened up new opportunities for people from distinct areas (e.g., biology, chemistry, physics, geology, and so on), who don’t have any idea about what is, for example, SOLID principles.

At QuintoAndar we have a huge focus on how to constantly improve our code quality and architecture. I think there are two main drivers related to this focus: 1) Necessity. We need to improve code maintainability to scale our pipelines, and 2) Culture. It means that managers and engineers recognize (and prioritize) the value of applying such principles to guarantee code quality and cleanliness.

A recent example of how this culture is translated into practice is that we began to read and discuss relevant books about coding and architecture. Such discussions have been bearing fruits because we’re improving our codebase and deepening the discussions about our codebase and architecture. And as would be expected, we began with the Clean Code book.

Here we describe some insightful ideas we have discussed along with the Clean Code study group that can be applied to data pipelines.

From the Clean Code to Clean Data Pipelines

Naming things

At a first glance, naming things apparently is quite simple. I mean, you could, for example, just name your spark data frame transformations as df , df2 and so on. What is the problem? The answer seems to be obvious for some of you but is not for others.

The rule of thumb which your “code will be more read than written” suits here and this ratio (read/written) is 10:1 according to Uncle Bob. Therefore we need to be intentional in our naming practice. The names df , df1 and df_whatever are saying nothing about what these data frames represent, or even about the transformations that they suffer. It causes “mental mapping” and the consequences of such practice are a code hard to read and understand. A good practice would be to describe what the data frame represents, as the following scala example:

We can use different conventions about naming classes, methods, and variables, but consistency needs to be universal in your data pipelines. It means that a pattern can be recognizable when one reads the code. For example, it isn't strictly necessary to name classes using camel case (sure, there are language conventions too), but it is required to understand the pattern of your codebase and follow it. If such a pattern does not exist, define/propose it! Everybody wins.

There are several useful tips and rules described in the book considering the naming practice (chapter 1), but the principle about naming could be: the machine is not the only reader of your code and, as we deal with other communication forms, we need to be clear, informative, intentional and consistent. Yeah, naming things is not so easy!

Decoupling things

1 — Avoid single script files

The book does not address this point, but the ease to read, transform, and write data with modern frameworks can tempt us to write code in single script files (e.g., python scripts and notebooks). It looks good to accelerate the delivery of the data to the end-user, but in the middle/long term (even in the short) will result in a code that is not scalable, extensible, and in the last instance, usable. The fate of this super cool one file script is to be rotten along the time.

One single script with many responsibilities

Your script files (and directory structure) also contain semantics and need to be organized in a meaningful way. If you want to save bytes by writing fewer big files, you will waste your time, time of other developers, and obviously will take a lot of project development hours to refactor your code! 😁

A simple example of refactoring single script pipelines

It's worth noting that refactoring is not bad, it is part of the organic development of your code. Indeed, the boy scout rule should be applied in software development too: “always leave the campground cleaner than you found it”.

2 — Organize things in useful abstractions

There is a lot to be said about this. However, the S letter (Single responsibility principle) of SOLID principles can be very useful here. Our classes/methods do not need to read and aggregate data, optimize partitions, and then write data to the storage at once. All of these steps from reading to writing data could be segregated into useful clients, services, and/or wrappers, and most importantly, our abstraction needs to be responsible for, as much as possible, one single action.

Code that has useful abstractions with single responsibility is more testable, readable, and consequently, maintainable.

3 — (Almost) Always try to anticipate potential extensions of your pipeline

Imagine this: a customer requires you to ingest some tables from a relational database into the data lake. You develop a good code that performs a good job. But now, the customer asks you to ingest data from a document database (e.g., Elasticsearch). The customer needs to present the data to the C-Level in the next two days. You then add an if in the code to handle ingestion from these two distinct data sources and create the pipeline to process the Elasticsearch data. Well, in the next week, guess what: you now need to consume data from external APIs. The cyclomatic complexity of your ingestion script can grow in the same proportion of available data sources. It’s not sustainable.

Therefore, as always as possible, you need to anticipate some future needs and understand that the rework to refactor this code can be a great waste of time in the near future. The “O” letter of SOLID (Open-closed) principles is that your code needs to be Open to extension and Closed for modification. It means that your pipeline needs to be, for example, extensible for adding more data sources, clients, and data targets, but it should be closed to modification by adding ifs and cases.

There are many more principles related to coding itself that can be extracted from this book, but considering data pipelines these are very useful.

4 — Decouple projects

When your pipelines get bigger, with several clients, logging framework, different services, and so on, you probably need to decouple these little Frankensteins into separate projects. In this way, you can install these projects when necessary in your data pipelines and reuse them in other projects. It will help to decouple not only the code but also CI/CD.

5 — Decouple dependencies

With the code decoupling between distinct contexts in different projects, you can avoid including unused dependencies in your data pipeline projects. For example, you are trying to create a new data pipeline (called NewPipeline) reusing the codebase of another project created by you (called BasePipeline). However, in the BasePipeline, you created some clients to handle database connections that will not be used in the NewPipeline. Or you even used some framework to process data that will be different in the NewPipeline. You are bringing dependencies that will be installed in your NewPipeline, but not used. So, decoupling things that have an independent context is useful and desirable in order to maintain your data pipelines clean.

Beyond the code: decoupling data lake monoliths

Although the book does not touch on architecture design, this is a critical point in order to create clean data pipelines and data products in general. Recently, the concept of data mesh emerged as a consequence of microservices architecture in SWE, which decomposes systems into distributed services for specific domains. With this decomposition, applications have clearer responsibilities, as our code should be.

On the other hand, in our data platforms, we had constructed what Zhamak Dehghani calls platforms that are in a “failure mode”. It is because our data lakes are indeed big monoliths, where all data from different domains (e.g., sales, marketing, and operations) needs to rest in a single place (it can be one of the sources to the creation of data swamps, that is, data ingested to the data lake, without curation, ownership, treatment, and so on). Any changes in the pipeline will necessarily cause changes in this monolith and vice versa. Then we have a weighty coupling between our data products which creates an architecture prone to failure.

Data mesh architecture tries to decouple this monolith. However, the changes required to implement this concept are more related to culture than tooling itself. In this framework, the ownership of the data is nearer the data domain producer.

Indeed, data engineering teams are more responsible for providing a data platform (e.g., infrastructure, tools, coding patterns, and so so on) which business domains will use to handle, curate, and serve their data to other domains. Since the complexities of changing business structure and culture, there are a few companies with data meshes implemented (e.g., see here). At QuintoAndar we have initiatives to implement such a framework, both in our team distribution, data ownership culture, and tools. We hope to be bearing fruits and share with the community about these initiatives soon.

Conclusion

Uncle Bob’s Clean code touch on pain points that all developers will face. But, beyond specific rules and practices, the mindset on clean coding is the main benefit of this reading. Although the book was written for SWEs in general, data engineering teams should adopt such principles and culture in their data pipeline practices, adapting them when necessary. Moreover, with more people from distinct backgrounds working with data, we also need to be explicit when applying clean coding on a daily basis and not assuming that it is common ground between data engineers.

I would like to thanks Lucas Mendes Mota Da Fonseca, Kenji, and Rafael Ribaldo for the rich discussions and contributions in this post!