Focus on Maintainability and Break Those ETL Tasks Up

Chris Moradi
97 Things
Published in
2 min readJul 11, 2019

As the data science tent widens, practitioners may excel at using prepared data, but lack the skills to do this preparation in a reliable way. These responsibilities can be split across multiple roles and teams, but enormous productive gains can be achieved by taking a full-stack approach where data scientists own the entire process from ideation through deployment.

Whether you’re a data scientist building your own ETLs or a data engineer assisting data scientists in this process, making your data pipelines easier to understand, debug, and extend will reduce the support burden for yourself and your teammates. This will enable greater iteration and innovation in the future.

The primary way to make ETLs more maintainable is to follow basic software engineering best practices and break the processing into small and easy to understand tasks that can be strung together — preferably with a workflow engine. Small ETL tasks are easier for new contributors and maintainers to understand, they’re easier to debug, and they allow for greater code reuse.

Doing too much in a processing step is a common pitfall for both the inexperienced and highly experienced. With less experience, it can be hard to know how to decompose a large workflow into small, well-defined transformations. If you’re relatively new to building ETLs, start by limiting the number of transformations you perform in each task by separating things like joining source tables, creating flag columns, and aggregating the results. You should seek advice and code reviewers from those with more experience and the teammates who will help support your ETLs in production. These reviews should focus on simplicity rather than performance.

Highly experienced data engineers can also produce overly dense pipelines because chains of complex transformations feel commonplace. While this is acceptable if they are the only ones maintaining these ETLs, it prohibits less experienced data scientists or engineers from supporting, modifying, or extending these pipelines. This can block innovation because the data scientists are reliant on a small number of experts to implement changes.

If you’re an experienced data engineer, you should consider how easy it is for someone with less experience to understand and build upon your work and consider refactoring so that it’s more accessible. Your work doesn’t need to be universally understood, but consider how much you and others are gaining from the added complexity.

There may be computational costs to breaking pipelines into small tasks as the work can’t be optimized across these boundaries. However, we sometimes focus too much on run-time performance when we should instead focus on the speed of innovation that’s enabled. There are cases where performance is critical, but optimizing a daily batch job to shave an hour off of the run-time may add weeks or even months to implement future enhancements.

--

--