Modularization using Python and Docker for Data Pipeline

A mini-guide about MicroService design for Data Pipeline

Jiazhen Zhu
Walmart Global Tech Blog
4 min readOct 26, 2020

--

Written by Jiazhen Zhu

Photo credit: Pixabay

In the data engineering area, ETL and data pipeline are key points. Specially in a big company like Walmart, we have huge number of pipelines which need to be developed, tested, deployed and supported. I will show some challenges I face and present one solution below.

Challenges

  • How can we split monolithic pipeline into a couple small MicroService pipelines?
  • How can we reuse our code or MicroService across the team and organization?
  • How can we optimize the pipeline separately?
  • How can we test our pipeline more effectively?
  • How can we deploy our pipeline painless?

Solution

Based on the architecture of solution, I use functional programming, directed acyclic graph (DAG), python project package, docker and crontab to deal all batch and streaming data process.

  • we have a couple of node functions using functional programming.
  • we use DAG as container to chain the nodes.
  • Python Project Package is our project base which can be published to Python Package Index as reusable model.
  • Docker is our outer container to deal with each special data pipeline, it can be published to Docker Hub for deployment or reusable container.

Let’s see all of them one by one below:

1. Functional programming

Functional programming (often abbreviated FP) is the process of building software by composing pure functions, avoiding shared state, mutable data, and side-effects. (reference 1)

  • For functional programming, because we don’t use shared state, just have input parameters and output return, it is easier to do the unit testing.
  • It becomes easier to split one monolithic program into a couple small functions.

2. Directed Acyclic Graph (DAG)

A DAG displays assumptions about the relationship between variables (often called nodes in the context of graphs). The assumptions we make take the form of lines (or edges) going from one node to another. These edges are directed, which means to say that they have a single arrowhead indicating their effect. (reference 2)

  • We will have clear data lineage based on the DAG.
  • It becomes easier to split one monolithic data pipeline into a couple small data pipelines.

3. Python Project Package

We can code, test, package and publish our reusable module.

  • It is easier to share the reusable module to the whole community.
  • We have one entry point among our module.

4. Docker

Docker is a set of platform as a service products that use OS-level virtualization to deliver software in packages called containers. Containers are isolated from one another and bundle their own software, libraries and configuration files; they can communicate with each other through well-defined channels. (reference 3)

  • We can optimize every data pipeline in one separated environment.
  • It is easier to deploy and monitor each data pipeline painless.

5. Yacron

A modern Cron replacement that is Docker-friendly

  • It is easier to schedule batch process inside of docker.

Example

In this example, I will present one ETL process from reading csv, cleaning data and generate the metric.

1. Functional Programming

In our utils.py file, everything is function and reusable code. For example, for get_data function, it has path parameter and return a dataframe data. for deduplicate_data function, it has raw dataframe parameter and return a cleaned dataframe data.

gitutils.py

In the below testing example, we have testing raw dataframe and testing clean dataframe, after installing the pytest, we can easily get the testing result (pass or failed).

test_utils.py

2. Directed Acyclic Graph (DAG)

The below Pipeline class is the container for the DAG.

DAG_pipeline.py

After initiating the Pipeline class, we can attach our tasks’ chain each other using decorate. For example, function get_raw_data is the first task, clean_data will be trigged after get_raw_data task is completed.

DAG_node.py

3. Python Project Package

We set the entry point for whole project to the demo_workflow under demo/workflow folder.

setup.py

4. Docker

There are 5 steps inside the Dockerfile

  • get the official python image: python:3.7-slim
  • install gcc and cron
  • install all python package based on requirements.txt
  • set directory and install demo package
  • copy the crontab.yaml file to needed location inside of docker
  • run the yacron command for batch or just run the demo for streaming/onetime
Dockerfile

5. Yacron

This crontab yaml file contains cron schedule expression (0 * * * *) which will run the script every hour inside of docker.

crontab.yaml

How to run the example?

--

--