🚀 Mastering the ETL Process: Extract, Transform, Load with Python! 🐍📊

4 min readJul 22, 2023

🔎 What is ETL and Why It Matters?

Imagine being the owner of a promising startup, armed with an incredible AI algorithm that predicts the risk for diabetes based on height and body weight. Your data, however, is scattered across CSV and JSON files, with measurements in different units. Enter ETL! 🧲 Extract, Transform, and Load takes on the task of gathering vast amounts of data from various sources, transforming it into a consistent format, and finally loading it into a central database or target file. 🗄️

📝 Let’s Implement ETL with Python

Let’s roll up our sleeves and get hands-on with Python! 🤓 We’ll start by defining the extract function, which involves using the glob function from the glob module. This nifty function allows us to find files with specific extensions (e.g., .json and .csv) and extract data from them, converting them into data frames for further processing. 📁

📚🔍 Let’s import some important libraries needed to perform the operations! 💻💡

import glob                         # this module helps in selecting files 
import pandas as pd                 # this module helps in processing CSV files
import xml.etree.ElementTree as ET  # this module helps in processing XML files.
from datetime import datetime

tmpfile    = "temp.tmp"               # file used to store all extracted data
logfile    = "logfile.txt"            # all event logs will be stored in this file
targetfile = "transformed_data.csv"

def extract_from_csv(file_to_process):
    dataframe = pd.read_csv(file_to_process)
    return dataframe

def extract_from_json(file_to_process):
    dataframe = pd.read_json(file_to_process,lines=True)
    return dataframe

def extract_from_xml(file_to_process):
    dataframe = pd.DataFrame(columns=["name", "height", "weight"])
    tree = ET.parse(file_to_process)
    root = tree.getroot()
    for person in root:
        name = person.find("name").text
        height = float(person.find("height").text)
        weight = float(person.find("weight").text)
        dataframe = dataframe.append({"name":name, "height":height, "weight":weight}, ignore_index=True)
    return dataframe

def extract():
    extracted_data = pd.DataFrame(columns=['name','height','weight']) # create an empty data frame to hold extracted data
    
    #process all csv files
    for csvfile in glob.glob("*.csv"):
        extracted_data = extracted_data.append(extract_from_csv(csvfile), ignore_index=True)
        
    #process all json files
    for jsonfile in glob.glob("*.json"):
        extracted_data = extracted_data.append(extract_from_json(jsonfile), ignore_index=True)
    
    #process all xml files
    for xmlfile in glob.glob("*.xml"):
        extracted_data = extracted_data.append(extract_from_xml(xmlfile), ignore_index=True)
        
    return extracted_data

🔄 Transforming the Data for Cohesion

In the transform step, we work some data magic! 🎩 We convert height from imperial to metric units (inches to millimeters) and weight from pounds to kilograms, ensuring consistency across the dataset. We’ll witness Python’s prowess in handling complex operations with ease, shaping our data to perfection. 📐

def transform(data):
        #Convert height which is in inches to millimeter
        #Convert the datatype of the column into float
        data.height = data.height.astype(float)
        #Convert inches to meters and round off to two decimals(one inch is 0.0254 meters)
        data['height'] = round(data.height * 0.0254,2)
        
        #Convert weight which is in pounds to kilograms
        #Convert the datatype of the column into float
        data.weight = data.weight.astype(float)
        #Convert pounds to kilograms and round off to two decimals(one pound is 0.45359237 kilograms)
        data['weight'] = round(data.weight * 0.45359237,2)
        return data

📥 Loading the Transformed Data

With the data transformed, it’s time to load it into our desired target file or database. In this case, we’ll save our data frame as a CSV file using the pandas library. Our ETL journey is nearing its climax! 📈

def load(targetfile,data_to_load):
    data_to_load.to_csv(targetfile)

📜 Timestamps for Transparency

To add a touch of transparency and accountability, we’ll implement a logging function that records the start and completion times of each step. This way, we’ll have a clear view of when every part of the ETL process begins and concludes. ⏰

# Import the datetime module to work with timestamps
from datetime import datetime

# Define a function called 'log' that takes a 'message' as input
def log(message):
    # Define the timestamp format to be used in the log entries
    timestamp_format = '%Y-%h-%d-%H:%M:%S'  # Year-Monthname-Day-Hour-Minute-Second
    
    # Get the current timestamp
    now = datetime.now()
    
    # Convert the timestamp to a formatted string using the defined format
    timestamp = now.strftime(timestamp_format)
    
    # Open the "logfile.txt" in append mode to add log entries
    with open("logfile.txt", "a") as f:
        # Write the timestamp, message, and a newline character to the file
        f.write(timestamp + ',' + message + '\n')

# Log the start of the ETL (Extract, Transform, Load) job
log("ETL Job Started")

# Log the start of the extract phase
log("Extract phase Started")
# Call the 'extract' function to get the extracted data
extracted_data = extract()  
# Log the end of the extract phase
log("Extract phase Ended")

# Display the extracted data (This line doesn't log anything, but it's here to show the data)

# Log the start of the transform phase
log("Transform phase Started")
# Call the 'transform' function with 'extracted_data' as input to get the transformed data
transformed_data = transform(extracted_data) 
# Log the end of the transform phase
log("Transform phase Ended")

# Display the transformed data (This line doesn't log anything, but it's here to show the data)

# Log the start of the load phase
log("Load phase Started")
# Call the 'load' function with 'targetfile' and 'transformed_data' as inputs to load the data
load(targetfile, transformed_data) 
# Log the end of the load phase
log("Load phase Ended")

# Log the end of the ETL job
log("ETL Job Ended")

✨ The ETL Symphony

As we bring all the pieces together, the symphony of Extract, Transform, and Load begins to play in harmony. 🎶 We’ll call each function in succession, orchestrating the ETL process seamlessly. The logging function ensures we have timestamps marking the journey’s milestones. 🎹

💡 Key Takeaways

Understanding the ETL process and its significance in data integration.
Practical implementation of ETL using Python, glob, pandas, and datetime.
Adding transparency with logging to track process timing.
The power of Python in data manipulation and transformation.

🚀 Mastering the ETL Process: Extract, Transform, Load with Python! 🐍📊

Written by Ashish Pathak