How to Create a Powerful ETL Data Pipeline with Python and AWS Services

8 min readJul 15, 2023

--

In data science, the ability to harness and extract insights from vast amounts of data is paramount. To accomplish this, data professionals rely on ETL (Extract, Transform, Load) data pipelines — a fundamental framework for streamlining data processing workflows. ETL pipelines facilitate the efficient extraction of data from diverse sources, transformation into a usable format, and loading into designated destinations for analysis.

For aspiring data scientists, mastering the art of building powerful data pipelines using Python and leveraging the capabilities of leading cloud computing platforms, such as Amazon Web Services (AWS), unlocks immense potential for scalable and efficient data processing.

In this blog post, I will demonstrate how I created an ETL data pipeline using Python, AWS EC2, S3, Glue and Athena.

ETL Data Pipeline (Image from https://www.talend.com/resources/what-is-etl/)

Overview

Case: I designed the ETL data pipeline to enhance efficiency and accuracy in pricing used laptops for a second-hand computer and electronics store project. The management relies primarily on eBay completed or sold items as references for pricing their products. To build machine learning models for price prediction, I collected a large volume of used laptop raw data from eBay.

Goal: Efficiently acquire and clean a large volume of used laptop data for further analysis

ETL data pipeline diagram: The diagram provides a high-level overview of the flow and stages involved in data extraction, transformation, and loading, allowing stakeholders to understand the overall data processing workflow.

ETL Data Pipeline Diagram (Image by Author)

Extract

The Python script I wrote to scrap public laptop listings from eBay.com and extract web data can be accessed here. It utilized the BeautifulSoup library, a popular Python library for web scraping that simplifies data extraction from HTML and XML documents.

I stored the Python script in an AWS S3 bucket and executed it on an AWS EC2 instance. Subsequently, I saved the output .csv file back to the same AWS S3 bucket. I have documented the entire process in another Medium post, which can be accessed here.

Transform

Prerequisites:

AWS account: a new account can be created here. The process is similar to signing up for any web services.

Please note that while some AWS services offer free-tier options, AWS services, in general, are not free and usage of services may incur costs based on factors such as resource usage, data transfer, and storage.

AWS IAM: a web service that enables secure control and management of user access to AWS resources by providing authentication, authorization, and permission management capabilities.
AWS Glue: a fully managed extract, transform, and load (ETL) service that makes it easy to prepare and load data for analytics by discovering, cataloging, and transforming data from various sources into a consistent format.

Step 1: Create an IAM role for AWS Glue

First, navigate to AWS IAM and create a role that provides appropriate access permissions for AWS Glue.

Create a new role in IAM (Image by Author)

For demonstration purposes and simplicity, I created a role with the PowerUserAccess policy, which provides users with full access to AWS services and resources, excluding IAM resource management. To grant more specific permissions to a role, other policies can be attached instead. For example, you can search for “Glue” in the search bar and choose the policy you wish to attach.

Attach PowerUserAccess Policy (Image by Author)

Step 2: Create an AWS Glue database

After accessing AWS Glue, navigate to the “Data Catalog” section and click on “Databases”. Follow the web flow to add a new database, where typically you would only need to provide a name for the database and use the default settings for the remaining configuration.

Create a database on AWS Glue (Image by Author)

Step 3: Import .csv files from S3 to AWS Glue using a Crawler

Once the database is created, click on the name of the database and proceed to create a Crawler. In AWS Glue, a crawler is an automated tool that scans and analyzes data sources to infer their schema and metadata, making it easier to catalog and process the data.

Create an AWS Glue Crawler (Image by Author)

After clicking on “Create crawler,” select “Add a data source” and follow the web flow to complete the configuration of the crawler.

Add a data source for Crawler (Image by Author)

Choose S3 as the data source for Crawler (Image by Author)

Configure security settings for the Crawler and choose the IAM role we have created in Step 1.

Set the target database to the database we have created in Step 2.

Set target database for Crawler (Image by Author)

Follow the web flow to complete the creation of the Crawler. Once the Crawler is created, run it and wait for it to finish processing.

Once the Crawler has completed running, you can view the table schema by clicking on “Tables” in the sidebar.

Step 4: Transform the data using an AWS Glue job

Next, we proceed to create an AWS Glue job to transform the data, making it ready for further analysis. Click on “ETL jobs” in the sidebar, and then select “Create job from a blank graph.”

Create a Job From a Blank Graph in AWS Glue (Image by Author)

First, we need to create a data source, which is the .csv file the Crawler added to the database.

Add a Data Source for the Glue Job (Image by Author)

To transform the data to align with the specific needs and requirements of future analysis, AWS Glue jobs offer a variety of options. Some popular ones include:

ApplyMapping: Maps source columns to target columns.
Filter: Filters data based on specified conditions to include or exclude certain records.
Join: Combines data from multiple sources based on a common key or condition.

For a comprehensive list of available options, you can refer to the official AWS documentation here.

Here is a quick example showcasing the usage of ApplyMapping to change column names, modify data types, and drop specific columns during the data transformation process:

Transform Data in Glue Job (Image by Author)

Next, we configure the target location for the transformed data as the S3 bucket, specifying the desired format such as .csv or .tsv.

Please note that the S3 target location path must end with a trailing slash (/). To ensure this, click on “Browse S3,” select the desired bucket, and make sure to add a slash (/) after the automatically filled path.

Add Data Target in AWS Glue Job (Image by Author)

Then, we proceed to configure some basic job properties under “Job details.” We can utilize the same IAM role we created in Step 1 since it already possesses the necessary access. For the remaining settings, I simply relied on the default options, but you can customize them.

Basic Properties under Job Details in AWS Glue (Image by Author)

Under “Job details,” we have the option to configure advanced properties, including saving the auto-generated Python script used for data transformation. This script can be stored in the same S3 bucket or any other preferred location of your choice.

Please note that once again, it is essential to add a trailing slash (/) after selecting the path to avoid encountering an error message.

Save the Auto-generated Script to the S3 Bucket (Image by Author)

Lastly, continue following the remaining steps in the web flow to create the job. Once the job is successfully created, run it. To monitor the progress, you can click on “Job run monitoring” in the sidebar to view the status of the job execution.

Monitor AWS Glue Job Run Status (Image by Author)

Step 5: Load the transformed data to Athena for query and analysis

In the final step, we will load the transformed data into AWS Athena for querying and further analysis. To accomplish this, we need to create a new Crawler specifically for the transformed data.

Create a New Crawler for the Transformed Data (Image by Author)

Create a new database for the transformed data.

Create a Database for the Transformed Data (Image by Author)

Next, proceed with the web flow to create the database and follow the same process as Step 3 to add the transformed data, which is saved in the S3 bucket, as a table in the database.

Load Transformed Data into AWS Athena (Image by Author)

Then, you can click on “Table data” to load the transformed data into AWS Athena. Additionally, we will need to configure the settings and add an IAM role for AWS Athena to ensure proper access and permissions.

Configure Settings for AWS Athena (Image by Author)

Finally, we can now query and analyze the transformed data using SQL in AWS Athena.

Please note that AWS Athena is not entirely free, as it operates on a pay-per-query model. The costs are based on the amount of data scanned by your queries and the complexity of those queries. There may be opportunities to obtain some free usage based on your account.

Congratulations on completing this demo! I hope that sharing my personal learning experience has provided valuable insights. Best wishes as you embark on your Data Science journey!

How to Create a Powerful ETL Data Pipeline with Python and AWS Services

Overview

Extract

Transform

Written by Amelia Tang