Refactoring ETL Flows in The Wild

4 min readDec 14, 2023

Efficiently managing data is crucial for thriving businesses, and ETL flows are central to achieving this goal. Analysts predict a significant surge in the ETL software market, with its size expected to triple between 2022 and 2030. This growth highlights the essential role ETL processes play in managing, refining, and integrating data from various sources into actionable insights. Yet, the widespread adoption of ETL flows brings forth a pressing challenge: how to bound the maintenance cost of an ever-expanding number of flows.

In this blog, we, at IBM Research Israel, describe an end-to-end prototype for ETL flow refactoring using the IBM DataStage data integration tool, aimed at reducing the maintenance cost, which keeps the human in the loop for refactoring decisions. We view a data integration flow as a directed acyclic graph (DAG) and seek out frequent common subgraphs within workloads. These recurring patterns are then extracted as shared subflows. To accomplish this, we leverage and extend a Frequent Subgraph Mining algorithm (FSM) known as gSpan. For more information have a look at our paper ״Refactoring ETL Flows in The Wild״ Published at the IEEE Big Data 2023 conference.

What is ETL?

ETL(Extract, Transform, Load) flows describe the sequence and steps involved in moving data from its source (Extract), performing operations or transformations on that data (Transform), and finally loading it into a destination like a database, data warehouse, or another system (Load).IBM DataStage is a cutting-edge data integration tool that helps you design, develop, and run ETL jobs. For example, this is a simple flow in IBM DataStage.

It takes two sources -DB2 tables(this is the Extract) joins them (Transform) and stores the result in a DB2 table (Load).

What is FSM?

FSM (Frequent subgraph mining) is a data mining technique used to discover recurring patterns (subgraphs) in a collection of graphs. It involves finding subgraphs that occur frequently, above a specified threshold, in a dataset.

The Challenge

Each customer workload can have tens to hundreds of thousands of ETL flows, developed over several years by various authors. These workloads often contain numerous repetitive or similar flows. When the same subflow is scattered across multiple flows within a project, maintenance and updates become more challenging. Similar to code refactoring where repetitive code is consolidated into reusable functions, in this context, the aim is to refactor repeated flows into separate entities called subflows. These subflows can then be reused across different flows, breaking down complex flows into more manageable components. This is an example of a flow with a subflow in IBM DataStage.

Despite the evident advantages of subflows, many users tend to underutilize them for various reasons. Even if they understand their benefits and aspire to use them, manually refactoring large datasets of flows proves to be a difficult and tedious task.

Our Solution

We created a semi-automated tool to refactor ETL flows from IBM DataStage. The tool allows users to run an analysis of their datasets in the background, get a list of potential subflows, and then explore and choose interactively what to refactor. We view a data integration flow as a directed acyclic graph (DAG). Note that DAGs are used to model workloads in a wide variety of tools such as DBT, Airflow, and DVC so our work is applicable beyond DataStage and data integration. Our refactoring workflow consists of three phases: the pre-processing phase, the refactoring analysis phase, and the user interaction phase.

During the pre-processing phase, which is offline and operates in the background, a label is generated for each node using a function called Lifter. This function takes node parameters as input and produces a label. Parameters can range from just the stage type to all node parameters. The next step is to eliminate duplicate graphs, as this will create redundant results in the FSM algorithm.

During the refactoring analysis phase, which also operates in the background, we identify potential subflows for refactoring. The first step in this phase runs the FSM algorithm. Given the significant number of resulting subflows, the subsequent action filters the results. The last step in this phase scores each subflow to form an ordered list of subflows recommended for refactoring. For a good user experience, the offline phases described above should take up to several minutes.

The user interaction phase is an online phase. The initial step requires the user to select the flows for refactoring from the ordered list generated in the refactoring analysis phase. To facilitate this process, we implement an interactive tool that enables users to examine each subflow and view associated statistics on the flows. We proceed with the flow refactoring task. A new flow is generated based on the chosen subflow and redirects the original flows to it. This figure shows an example of two flows before and after refactoring.

Conclusions

We address the challenges posed by the proliferation of ETL flows, which gradually become harder to maintain as their number increases with semi-automated tool for ETL flow refactoring.

For additional information, please refer to our paper.

Learn about our dataset in this blog.