πŸš€ Mastering the ETL Process: Extract, Transform, Load with Python! πŸπŸ“Š

Ashish Pathak
4 min readJul 22, 2023

πŸ”Ž What is ETL and Why It Matters?

Imagine being the owner of a promising startup, armed with an incredible AI algorithm that predicts the risk for diabetes based on height and body weight. Your data, however, is scattered across CSV and JSON files, with measurements in different units. Enter ETL! 🧲 Extract, Transform, and Load takes on the task of gathering vast amounts of data from various sources, transforming it into a consistent format, and finally loading it into a central database or target file. πŸ—„οΈ

πŸ“ Let’s Implement ETL with Python

Let’s roll up our sleeves and get hands-on with Python! πŸ€“ We’ll start by defining the extract function, which involves using the glob function from the glob module. This nifty function allows us to find files with specific extensions (e.g., .json and .csv) and extract data from them, converting them into data frames for further processing. πŸ“

πŸ“šπŸ” Let’s import some important libraries needed to perform the operations! πŸ’»πŸ’‘

import glob                         # this module helps in selecting files 
import pandas as pd # this module helps in processing CSV files
import xml.etree.ElementTree as ET # this module helps in processing XML files.
from datetime import datetime
tmpfile    = "temp.tmp"               # file used to store all extracted data
logfile = "logfile.txt" # all event logs will be stored in this file
targetfile = "transformed_data.csv"
def extract_from_csv(file_to_process):
dataframe = pd.read_csv(file_to_process)
return dataframe

def extract_from_json(file_to_process):
dataframe = pd.read_json(file_to_process,lines=True)
return dataframe

def extract_from_xml(file_to_process):
dataframe = pd.DataFrame(columns=["name", "height", "weight"])
tree = ET.parse(file_to_process)
root = tree.getroot()
for person in root:
name = person.find("name").text
height = float(person.find("height").text)
weight = float(person.find("weight").text)
dataframe = dataframe.append({"name":name, "height":height, "weight":weight}, ignore_index=True)
return dataframe
def extract():
extracted_data = pd.DataFrame(columns=['name','height','weight']) # create an empty data frame to hold extracted data

#process all csv files
for csvfile in glob.glob("*.csv"):
extracted_data = extracted_data.append(extract_from_csv(csvfile), ignore_index=True)

#process all json files
for jsonfile in glob.glob("*.json"):
extracted_data = extracted_data.append(extract_from_json(jsonfile), ignore_index=True)

#process all xml files
for xmlfile in glob.glob("*.xml"):
extracted_data = extracted_data.append(extract_from_xml(xmlfile), ignore_index=True)

return extracted_data

πŸ”„ Transforming the Data for Cohesion

In the transform step, we work some data magic! 🎩 We convert height from imperial to metric units (inches to millimeters) and weight from pounds to kilograms, ensuring consistency across the dataset. We’ll witness Python’s prowess in handling complex operations with ease, shaping our data to perfection. πŸ“

def transform(data):
#Convert height which is in inches to millimeter
#Convert the datatype of the column into float
data.height = data.height.astype(float)
#Convert inches to meters and round off to two decimals(one inch is 0.0254 meters)
data['height'] = round(data.height * 0.0254,2)

#Convert weight which is in pounds to kilograms
#Convert the datatype of the column into float
data.weight = data.weight.astype(float)
#Convert pounds to kilograms and round off to two decimals(one pound is 0.45359237 kilograms)
data['weight'] = round(data.weight * 0.45359237,2)
return data

πŸ“₯ Loading the Transformed Data

With the data transformed, it’s time to load it into our desired target file or database. In this case, we’ll save our data frame as a CSV file using the pandas library. Our ETL journey is nearing its climax! πŸ“ˆ

def load(targetfile,data_to_load):
data_to_load.to_csv(targetfile)

πŸ“œ Timestamps for Transparency

To add a touch of transparency and accountability, we’ll implement a logging function that records the start and completion times of each step. This way, we’ll have a clear view of when every part of the ETL process begins and concludes. ⏰

# Import the datetime module to work with timestamps
from datetime import datetime

# Define a function called 'log' that takes a 'message' as input
def log(message):
# Define the timestamp format to be used in the log entries
timestamp_format = '%Y-%h-%d-%H:%M:%S' # Year-Monthname-Day-Hour-Minute-Second

# Get the current timestamp
now = datetime.now()

# Convert the timestamp to a formatted string using the defined format
timestamp = now.strftime(timestamp_format)

# Open the "logfile.txt" in append mode to add log entries
with open("logfile.txt", "a") as f:
# Write the timestamp, message, and a newline character to the file
f.write(timestamp + ',' + message + '\n')

# Log the start of the ETL (Extract, Transform, Load) job
log("ETL Job Started")

# Log the start of the extract phase
log("Extract phase Started")
# Call the 'extract' function to get the extracted data
extracted_data = extract()
# Log the end of the extract phase
log("Extract phase Ended")

# Display the extracted data (This line doesn't log anything, but it's here to show the data)

# Log the start of the transform phase
log("Transform phase Started")
# Call the 'transform' function with 'extracted_data' as input to get the transformed data
transformed_data = transform(extracted_data)
# Log the end of the transform phase
log("Transform phase Ended")

# Display the transformed data (This line doesn't log anything, but it's here to show the data)

# Log the start of the load phase
log("Load phase Started")
# Call the 'load' function with 'targetfile' and 'transformed_data' as inputs to load the data
load(targetfile, transformed_data)
# Log the end of the load phase
log("Load phase Ended")

# Log the end of the ETL job
log("ETL Job Ended")

✨ The ETL Symphony

As we bring all the pieces together, the symphony of Extract, Transform, and Load begins to play in harmony. 🎢 We’ll call each function in succession, orchestrating the ETL process seamlessly. The logging function ensures we have timestamps marking the journey’s milestones. 🎹

πŸ’‘ Key Takeaways

  • Understanding the ETL process and its significance in data integration.
  • Practical implementation of ETL using Python, glob, pandas, and datetime.
  • Adding transparency with logging to track process timing.
  • The power of Python in data manipulation and transformation.

--

--