ETL Pipeline with Pyjanitor

Maxine Attobrah
7 min readNov 10, 2022

Extract, clean and store data using Python ETL tools.

Overview

Machine learning algorithms need good quality data to train on in order for it to make good predictions. This is done through ETL.

What is ETL?

Photo by Maxine Attobrah (Author)

ETL (Extract Transform Load) is the foundation of creating efficient machine learning algorithms. ETL is a three step data integration process where data is extracted from one or multiple data sources, transformed (cleaned, scrubbed, formatted) and loaded into a database. There are various types of ETL tools available with nice user-friendly interfaces that you can use to complete these tasks. However, Python has many open source ETL tools that can get the job done as well. One very popular python tool within the data scientist community is Pandas.

In the transformation step of ETL. Exploratory Data Analysis (EDA) and data cleaning are important tasks to be performed to identify and remove things such as incorrect, corrupted, duplicate and inconsistent data.

What is Pyjanitor? How Can It Help?

Pyjanitor is an open source tool in Python used to clean datasets. The advantage of this tool is that it provides an easier to read, verb-based method chaining API for common Panda routines. This helps to make writing code…

--

--

Maxine Attobrah

MS Electrical & Computer Engineering and MS Engineering & Technology Innovation Management Graduate Student at Carnegie Mellon University