A Beginner’s Guide to ETL with Python

Jesús Cantú
11 min readJul 9, 2023

--

ETL Process

ETL, which stands for Extract, Transform, Load, is a crucial process in data management. It involves retrieving data from various sources, modifying it to fit business needs or target system requirements, and then loading it into a central location such as a data warehouse. In this article, we will simplify the ETL process for beginners, delve into each step, and illustrate it with a real-world Python example using publicly available data.

What is Data Extraction?

Data extraction is the initial phase of the ETL process. This involves pulling data from a range of source systems, including but not limited to databases, CRM systems, or flat files like CSVs. The objective is to collect as much relevant data as possible to power our analytics systems.

The Transformation Process

Once the data is extracted, it enters the transformation stage. Here, the data is cleaned, validated, and prepared to meet the requirements of the business or the target system. This could involve removing duplicates, correcting errors, or converting data into compatible formats. The goal is to maintain data integrity and quality before it is loaded into the data warehouse.

Loading Data

After the data has been transformed, it is then loaded into a data warehouse or other target system. This is where the processed data will reside for further use, usually for analytics, business intelligence, or reporting purposes.

ETL in Python

Python is a great language for performing ETL because of its readability and the wide range of data-centric libraries it offers. Here, we’re going to use Python to perform ETL on two datasets from the UCI Machine Learning Repository: the Wine dataset and the Wine Quality dataset. We’ll use the Pandas and NumPy libraries for data manipulation.

The Wine Dataset

This dataset is the result of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. It consists of 178 samples with 13 chemical analysis attributes. Columns are labeled 1–13 and correspond to the following:

  1. Alcohol: The alcohol content in the wine.
  2. Malic Acid: A type of acid found in wine.
  3. Ash: A measure of the wine’s “ash” content, which is the inorganic matter remaining after evaporation and incineration.
  4. Alkalinity of Ash: A measure of the wine’s alkalinity, specifically the ash’s alkalinity.
  5. Magnesium: The magnesium content in the wine.
  6. Total Phenols: A measure of the total phenol content in the wine.
  7. Flavonoids: A subset of phenols, flavonoids are a type of antioxidant.
  8. Non-flavonoid Phenols: This attribute measures the non-flavonoid phenols in the wine.
  9. Proanthocyanidins: Another group of phenols.
  10. Color Intensity: A measure of the wine’s color intensity.
  11. Hue: A wine attribute related to its color.
  12. OD280/OD315 of Diluted Wines: This is a measurement of the wine’s absorbance at 280nm and 315nm wavelengths after dilution.
  13. Proline: A measure of the Proline (an amino acid) content in the wine.
First five rows of the Wine dataset (raw data)

The Wine Quality Dataset

This dataset contains physicochemical attributes from red variants of the Portuguese “Vinho Verde” wine. It includes 1,599 samples and 12 attributes, including a quality score:

  1. Fixed Acidity: The non-volatile acids present in the wine.
  2. Volatile Acidity: The amount of acetic acid in the wine.
  3. Citric Acid: The citric acid content, which can add freshness and flavor to wines.
  4. Residual Sugar: The amount of sugar remaining after fermentation.
  5. Chlorides: The amount of salt in the wine.
  6. Free Sulfur Dioxide: The free form of SO2 present in the wine.
  7. Total Sulfur Dioxide: The total amount of SO2 in the wine.
  8. Density: The density of the wine, which can provide some idea about the alcohol content and sugar content.
  9. pH: Describes how acidic or basic the wine is on a scale from 0 (very acidic) to 14 (very basic).
  10. Sulphates: Wine additive which can contribute to sulfur dioxide gas (S02) levels.
  11. Alcohol: The alcohol content in the wine.
  12. Quality: A score between 0 and 10 given by experts.
First five rows of the Wine Quality dataset (raw data)

Coding Example

# Required Libraries
import pandas as pd
import numpy as np

# Extraction
wine_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data"
wine_data = pd.read_csv(wine_url, header=None)

wine_quality_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
wine_quality_data = pd.read_csv(wine_quality_url, sep=";")

# Initial look at the data
print(wine_data.head())
print(wine_quality_data.head())

# Transformation
# Assigning meaningful column names
wine_data.columns = ['class', 'alcohol', 'malic acid', 'ash',
'alcalinity of ash', 'magnesium', 'total phenols',
'flavonoids', 'nonflavonoid phenols', 'proanthocyanidins',
'color intensity', 'hue', 'OD280/OD315 of diluted wines',
'proline']

# Converting Class column into categorical datatype
wine_data['class'] = wine_data['class'].astype('category')

# Checking for any missing values in both datasets
print(wine_data.isnull().sum())
print(wine_quality_data.isnull().sum())

# Normalizing 'alcohol' column in the wine_data using Min-Max normalization
wine_data['alcohol'] = (wine_data['alcohol'] - wine_data['alcohol'].min()) / (wine_data['alcohol'].max() - wine_data['alcohol'].min())

# Creating an average quality column in wine_quality_data
wine_quality_data['average_quality'] = wine_quality_data[['fixed acidity', 'volatile acidity', 'citric acid',
'residual sugar', 'chlorides', 'free sulfur dioxide',
'total sulfur dioxide', 'density', 'pH', 'sulphates',
'alcohol']].mean(axis = 1)

# Creating a 'quality_label' column based on 'average_quality'
wine_quality_data['quality_label'] = pd.cut(wine_quality_data['average_quality'], bins=[0, 5, 7, np.inf],
labels = ['low', 'medium', 'high'])

# Loading
# Saving the transformed data as a csv file
wine_data.to_csv('wine_dataset.csv', index = False)
wine_quality_data.to_csv('wine_quality_dataset.csv', index = False)

In this Python script, we extract the Wine and Wine Quality datasets from the UCI repository, and then carry out several transformation steps. For the Wine dataset, we assign meaningful column names, convert the ‘class’ column into a categorical datatype, check for missing values, and normalize the ‘alcohol’ column using Min-Max normalization.

Min-Max normalization is a data preprocessing technique that scales numerical data to a common range, typically between 0 and 1. It is useful for ensuring features are on a similar scale, handling outliers, improving interpretability and visualization, and enhancing convergence speed in optimization algorithms. By normalizing the data, it prevents certain features from dominating and facilitates comparisons between variables. However, other normalization methods may be more suitable depending on the data characteristics and analysis requirements.

For the Wine Quality dataset, we create an ‘average_quality’ column by taking the mean of several columns, and then create a ‘quality_label’ column based on this average. Finally, we load the transformed data into two new CSV files.

First five rows of the Wine dataset (processed data).
First five rows of the Wine Quality dataset (processed data).

Once the ETL process is complete and your data has been extracted, transformed, and loaded into a data warehouse or other system, there are a number of different ways this data can be utilized. Data analytics, machine learning, data visualization, and reporting are some common applications. Here are some examples:

Data Analytics

The primary purpose of performing ETL is to prepare your data for analysis. You can use descriptive, predictive, or prescriptive analytics techniques depending on your needs. For instance, with the Wine Quality dataset, you could perform a statistical analysis to understand the factors that influence wine quality. This might involve using correlation matrices or regression analysis to identify the variables most strongly associated with high-quality wine. Here’s a simple example that uses Seaborn and Matplotlib to create and plot a correlation matrix:

import seaborn as sns
import matplotlib.pyplot as plt

# Correlation Matrix
corr = wine_quality_data.corr(numeric_only = True)

# Plot heatmap
plt.figure(figsize = (12, 10))
sns.heatmap(corr, annot = True, cmap = 'coolwarm')
plt.title('Correlation Matrix of Wine Quality Data')

# Save the figure
plt.savefig('correlation_matrix.png', dpi = 300, bbox_inches = 'tight')

plt.show()

When reading a correlation matrix and its corresponding heatmap, you can focus on a few key aspects. First, examine the diagonal elements, which represent the correlations of variables with themselves and should always be 1 since a variable is perfectly correlated with itself. Next, observe the off-diagonal elements. Positive values indicate a positive correlation, while negative values indicate a negative correlation. The closer the value is to 1 or -1, the stronger the correlation. Pay attention to the color intensity, with darker shades indicating stronger correlations. Additionally, patterns in the heatmap can reveal clusters of variables with similar correlations. This information helps identify relationships between variables and can guide further analysis or modeling decisions.

Machine Learning

ETL also paves the way for machine learning. Once the data is cleaned and transformed, it’s possible to train machine learning models on this data to make predictions or classifications. For example, with the Wine dataset, you could develop a model to classify wines into their respective classes based on the provided features. Here’s a simple example using Scikit-Learn’s Random Forest Classifier:

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Define predictors and target
X = wine_data.drop('class', axis = 1)
y = wine_data['class']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2,
random_state = 42)

# Initialize the model
clf = RandomForestClassifier(random_state = 42)

# Fit the model
clf.fit(X_train, y_train)

# Make predictions
y_pred = clf.predict(X_test)

# Check accuracy
print("Accuracy:", accuracy_score(y_test, y_pred))
Accuracy: 1.0

The result `Accuracy: 1.0` signifies that your model correctly classified all samples in your test set. Accuracy is a measure of the model’s performance and is calculated as the ratio of the correctly predicted instances to the total instances in the dataset. An accuracy of 1.0 means that the model made no mistakes when classifying the data in the test set.

However, an accuracy score of 1.0 (or 100%) could indicate a few things:

  1. The model has learned the training data perfectly: This might be the case if the dataset is simple and all features contribute directly to the outcome. However, this is rare in complex real-world datasets.
  2. Overfitting: This is a more common reason for a perfect accuracy score. Overfitting occurs when the model learns the training data too well, capturing not just the general patterns, but also the noise or the outliers. It means the model will perform perfectly on the training data (and, due to the random chance of splitting, potentially on your test set), but it may not generalize well to new, unseen data.
  3. Data leakage: This occurs when information from the test set leaks into the training set. In such cases, the model has access to the data it should not know, thereby leading to overly optimistic results.

If your model is consistently giving you an accuracy of 1.0, it’s worth checking to make sure you haven’t accidentally introduced data leakage or overfitting. You could try cross-validation, introduce regularization, or check your feature selection and data cleaning steps to prevent these issues.

For example, using cross-validation might give a more realistic picture of your model’s performance:

from sklearn.model_selection import cross_val_score

# Initialize the model
clf = RandomForestClassifier(random_state = 42)

# Compute cross-validation score
scores = cross_val_score(clf, X, y, cv = 5)

print("Cross-validation scores:", scores)
print("Average cross-validation score:", scores.mean())
Cross-validation scores: [0.97222222 0.94444444 0.97222222 0.97142857 1.  ]     ]
Average cross-validation score: 0.9720634920634922

This will give you an array of scores from five different runs due to(cv=5)of the classifier, each with a different train/test split, and the average of those scores. If these scores are all high and relatively close to 1.0, your model is likely performing well. If there is large variation or the scores are significantly less than 1.0, then the original score of 1.0 on the test set may be due to overfitting or random chance.

Data Visualization

Data visualization is an excellent way to understand the patterns, trends, and correlations in your data. For instance, you could use Seaborn or Matplotlib to create scatter plots, histograms, or bar plots to better understand the Wine and Wine Quality datasets.

import numpy as np
import matplotlib.pyplot as plt

# Plot histogram
plt.figure(figsize = (10, 7))
wine_quality_data['quality'].plot(kind = 'hist', rwidth = 0.95,
bins = np.arange(2.5, 9))

plt.title('Distribution of Wine Quality Ratings')
plt.xlabel('Quality Ratings')
plt.ylabel('Count')

# Adjust x-ticks
plt.xticks(np.arange(3, 9, step=1))

# Save the figure
plt.savefig('histogram_wine_quality_ratings.png', dpi = 300,
bbox_inches = 'tight')
plt.show()

In this code, the bins parameter is set to np.arange(2.5, 9), which generates bins from 2.5 to 9 with a step of 1, aligning with the rating categories. The plt.xticks(np.arange(3, 9, step=1)) line adjusts the x-ticks so that they are positioned at the center of each bin (i.e., between the bars).

Please note that the starting and ending values for the bins parameter (i.e., 2.5 and 9 in this case) and the values for the np.arange() function inside plt.xticks() (i.e., 3 and 9 in this case) should be adjusted based on the actual range of your 'quality' ratings.

Reporting

ETL also enables you to conduct more effective reporting. Clean, organized data is crucial for generating accurate, up-to-date reports. For example, you could use Python’s Pandas library to generate monthly reports of wine quality or use a BI tool like Tableau or Power BI to create interactive dashboards based on your cleaned and transformed data.

ETL to Infinity & Beyond

As our world continues to generate vast amounts of data, the demand for efficient data management processes such as ETL grows. As we’ve demonstrated, utilizing ETL best practices can significantly enhance the quality of data and the efficiency of its analysis, making it a powerful asset in today’s data-driven landscape.

Looking into the future, the importance of ETL and data engineering is set to rise. The emergence of Big Data, machine learning, artificial intelligence, and IoT technologies all depend on reliable, clean data, making ETL processes more critical than ever. Businesses that can efficiently extract, transform, and load their data are more likely to leverage these technologies successfully, providing a competitive edge.

The rise of cloud computing is also shaping the ETL landscape, with ETL tasks becoming increasingly integrated with cloud services. Such integration allows for greater scalability and performance, enabling businesses to handle larger datasets faster and more effectively.

For individuals considering a career in data science, business intelligence, or similar fields, learning data engineering, and ETL processes is a valuable investment. The demand for professionals proficient in ETL processes and tools is steadily increasing as more companies realize the potential value of their data.

Understanding ETL not only opens up career opportunities but also allows you to contribute to this exciting field. With the ongoing advancement in data processing technologies, there’s no doubt that the importance and complexity of ETL processes will only continue to grow. By learning and implementing ETL best practices, you’ll be well-equipped to navigate and thrive in this evolving landscape.

Embracing ETL and data engineering practices is more than just a technological decision — it is a strategic move that can propel businesses into the next phase of digital innovation and secure a competitive edge in the future. Those who are well-versed in these areas will be leading the way, making the journey from raw data to actionable insight more streamlined and enlightening than ever.

Resources

  1. Data Profiling: Getting Started with Data Profiling (Talend)
  2. Data Validation Rules: Data Quality Rules & Validation (Informatica)
  3. ETL: Translate raw data into actionable date (Databricks)
  4. ETL testing: A comprehensive guide to ensuring data quality and integration (Talend)
  5. Documentation: Overview Software Documentation (Geeksforgeeks)
  6. Continuous Monitoring and Optimization: 5 Ideas to Expand Your ETL Process Monitoring (Paessler)

ETL is a fundamental part of data management. While it might seem complex at first, understanding the basic steps and practicing with real-world examples can simplify the process. Python, with its extensive range of data-focused libraries, can be a powerful tool for ETL. Remember, effective data management is key to any data-driven organization. Happy Learning!!

Did you enjoy this post? Would you like to sponsor my caffeine addiction? If YES, you can BUY ME A COFFEE and help me stay awake to continue creating awesome things!

--

--

Jesús Cantú

I write blogs about topics in computer science, data science & engineering.