Modularization Using Auto-Generated Pipeline With DataBathing

Published in

Walmart Global Tech Blog

6 min readMay 24, 2022

A mini-guide about MicroService design for Data Pipeline: Part III

In Part I, Modularization Using Python and Docker for Data Pipeline, and Part II, Modularization Using Graph-Style Serverless Pipeline Framework, we have more ideas about using Docker or Serverless design to adopt data pipeline. However, we still need to write the Python code to achieve the required logic in those designs.

Then there was a sound in my head that said, can we have an autopilot kind of framework to support the ETL job work? Can we get the benefits both from the power of Apache Spark and simplicity with SQL-Centric tools? After creating and using DataBathing, we have our autopilot version pipeline framework now.

Let’s meet with Next Generator Auto-Generated Pipeline Framework !!!

Agenda

What is autopilot for data pipeline?
Auto-Generated pipeline framework
HERO: DataBathing !!!
Small demo
Currently supported features
Next roadmap
Contribution
Summary

What is autopilot for data pipeline?

Let’s see what is current autopilot for the car at first. There are two critical points based on my understanding. The first one is it can help the driver steer, accelerate and brake, and the driver doesn’t need to take extra action. The second one is it still needs supervision for safety propose. In short, the car can take care of most jobs with little input from drivers.

Autopilot enables your car to steer, accelerate and brake automatically within its lane. Current Autopilot features require active driver supervision and do not make the vehicle autonomous.

I want to use some concepts from car autopilot to auto-data-pipeline. Auto-data-pipeline is a concept in which a pipeline can take care of most ETL jobs with little logic input from developers and domain experts.

Auto-Generated pipeline framework

Our framework is one of the auto-data-pipeline. Based on the Yaml files for Extractor, Transfomer, and Loader, the framework can parse the files, auto-generated data pipeline, and trigger the pipeline by itself. In Figure 1, we can find three different areas or features in the framework. I will discuss each of them below:

Parsing Engine

There are three different engines for the parsing parts. Those parsing engines can take care of those Yaml inputs as logic and transform the logic into Dataframe calculation flows, which will be sent to the next engine — Auto Gen Pipeln Engine. We have some standardized rules for three different YAML inputs and will explain more in the demo session.

Auto Gen Pipeln Engine

When Auto Gen Pipeln Engine gets the Dataframe calculation flows, it will auto-generate the data pipeline based on the sequence of flows and logic. Tasks Running Engine will pick up and run the whole Spark jobs without any coding input from users.

Tasks Running Engine

After getting the sequence of Spark jobs, this running engine will trigger all tasks. It is also without any engineering input.

Based on the three types of engines, we can find that, for our auto-data-pipeline, users can focus on logic itself (YAML) instead of coding and debugging.

Figure 2— *Auto-Generated Pipeline Framework with DataBathing*

Figure 2 is the high-level diagram that presents the relationship and dataflow among Auto-Generated Workflow Service, DataBathing Service, and Airflow. Because we are building all those services like reusable components, it is easy to plug in and out of our current process.

HERO: DataBathing !!!

We can find an instrumental library (DataBathing: Blog, Pypi, and Github). The main idea is to parse the SQL query into Spark Code. Please read the Blog and Github if you are interested. Otherwise, we can treat it as a black box, part of our framework.

Small demo

In this demo, I will go through one example, which has been discussed in Figure 1. This ETL process reads four orcs, transforms, and loads the needed result.

0. Prerequest

Please install DataBathing and all packages which are needed for our demo.

1. Extractors

In the below YAML, we have three different keys under sources: name, type, and path.

name: this is the Dataframe name that will be loaded.
type: the sources, which can be an orc, excel, MySQL, etc.
path or query: the source’s location or the query we need to fetch from the database.

2. Transformers

We can use it for our logic design in two different ways: multiple queries or “with statement”. We discuss them one by one below. Also, we can combine them.

steps: it is a list of dictionaries that presents the step for logic calculation.
query: we have a query that we need to pass for logic design in each step.
name: it is the final name of the Dataframe for the result of the query.
share_df: it is the list of Dataframes that are shared globally and can be used by other components.

multiple queries

In the below git, for the first step, we calculate some transaction data and save the result as fact_dim_df. In the second step, by using fact_dim_df, we generate the aggregation result.

with statement

Another way to do the logic linking is to use “with statement”. We can find the example below. Within the “with statement”, we can define the data lineage. For those middleware results such as step1, step2, etc., we save them in our pipeline for future usage; you can check with the share_df variable.

3. Loaders

In the last YAML, we have four different keys under targets that are very similar to extractors: name, type, mode, and path.

name: the name of the Dataframe which we have processed in the before steps.
type: the type of file we need to save.
mode: the required mode of saving.
path: the required file path.

4. Running

We have one entry point which is workflowRunner to adopt the three YAML files in order to trigger the auto data pipeline.

Currently supported features

All features will be categorized into three domains: Exactors, Transforms, or Loaders. The transform features have been implemented with this framework with DataBathing. For Exactors and Loaders, we need to extend more for the connection features in the next step.

Exactors:

BigQuery features
Google Cloud Storage features

Transforms:

Auto Detected features
Parsing YAML features — Extractors, Transforms, and Loaders.
Pipeline DAG features

Loaders:

BigQuery features
Google Cloud Storage features

Next roadmap

Everything we need to do is to input our logic into three Yaml files. In this case, we can create a UI that will be linked to the framework serving. In Figure 3, we can see the high-level design for this PaaS.

Except for the Platform as a Service plan, most of the upcoming features will focus on linking to different sources and targets. I will some of them below:

Exactors and loaders:

MySql features
Azure SQL features
MFT features
CSV/EXCEL

Contribution

If you need more features from auto-generated, you can help make this better by raising the issue which can show the error, or creating the PR with new features and tests describing the problem. If you also submit a fix, then you also have my gratitude.

You can find our auto-generated pipeline framework on below GitHub homepage:

GitHub - jason-jz-zhu/auto-generated-pipeline-framework

Please install databathing pip install databathing flake configuration file .flake CMD Report Upload CMD Main Report…

github.com

Summary

By using this Next Generator Auto-Generated Pipeline Framework, you can focus on the domain logic itself rather than spending more time on coding and debugging.