Introduction to AWS Glue for ETL : Data Pipeline from S3/MySQL to AWS Redshift

Arlene R
19 min readJun 8, 2023

--

Overview of an ETL Process

ETL

ETL stands for Extract, Transform, Load. It’s a process of extracting data from various sources, transforming it to meet specific requirements, and loading it into a target system, typically a data warehouse. An ETL process is a commonly used approach in data integration and data warehousing.

The frequency of ETL processes can vary depending on the specific requirements of the data integration and data warehousing tasks, it can be daily, weekly, monthly.

Extract

This is retrieving data from various data sources.

  • Database such as MYSQL, Oracle, Mongo etc
  • Structured and semi-structured data sources in formats such as flat files, XML, JSON, CSV etc
  • APIs

Transform

This is manipulating and modifying the extracted data to meet specific requirements, including cleansing, aggregating, or enriching the data.

  • Date Formatting & Conversion — UTC to Local Time and vice versa
  • Applying Calculations
  • String functionalities E.g. : Concatenations
  • Case Insensitive or not — Curbside, CURBSIDE, curbside -> Curbside
  • Use of Joins
  • Filtering
  • Conversion of units of measurement
  • Merging data

Load

This is inserting the transformed data into a target system, typically, a data warehouse, making it available for analysis and reporting purposes.

A Data pipeline is the set of processes used to move data from a source or multiple sources into a target repository database such as a data warehouse and ETL data pipeline is a particular type of data pipeline.

Introduction to AWS Glue

AWS Glue, is a fully managed, serverless data integration ETL (Extract, Transform, Load) service that makes it easier to discover, prepare, move, and integrate data from multiple sources for analytics, Machine Learning (ML) and application development.

With AWS Glue, users can easily create and run ETL jobs that extract data from various sources, transform the data and load it.

AWS Glue uses ETL jobs to extract data from a combination of other cloud services offered by Amazon Web Services (AWS) and incorporates it into data warehouses and data lakes.

To enable the data integration process smoother, Glue offers both visual and code-based tools. These include data discovery, modern ETL, cleansing, transforming, and centralized cataloging. It’s also serverless, which means there’s no infrastructure to manage. It uses application programming interfaces (APIs) to transform the extracted data set for integration and to help users monitor jobs. Users can incorporate ETL jobs to a schedule or pick events that will trigger a job. Once triggered, AWS Glue extracts the data, transforms it based on code that Glue generates automatically, and loads it into Amazon S3 or Amazon Redshift.

Customers using AWS Glue

There are many renowned customers who use AWS Glue for their ETL needs. Given below are how each of these companies uses AWS Glue.

  • BMW: BMW uses AWS Glue to process and transform data for the connected car platform. They use Glue to extract data from various sources, transform it, and load it into their data warehouse. This allows them to analyze data from their vehicles and improve their products and services.
  • Upserve: Upserve, a restaurant management platform, uses AWS Glue to process and transform data for their analytics and reporting features. They use Glue to extract data from various sources, transform it, and load it into their data warehouse. This allows them to provide insights to their customers and help them make data-driven decisions.
  • Expedia: Expedia, the online travel booking company, uses AWS Glue to process and transform data for their travel products. They use Glue to extract data from various sources, transform it, and load it into their data warehouse. This allows them to analyze customer behavior and improve their products and services.

These are just a few examples of how companies are using AWS Glue to process and transform data. AWS Glue is a powerful tool that can be used in a wide variety of industries and use cases.

AWS Glue Features

The features of AWS Glue provide users with the capability to discover, transform, and integrate data effectively, ensuring data accuracy, scalability, and operational ease in the ETL process.

  1. Data Extraction and Transformation: AWS Glue enables easy extraction of data from various sources, including databases, files, and streaming data. It offers built-in transforms and custom transformation capabilities using Python or Apache Spark for data manipulation and cleansing.
  2. Automatic Schema Discovery and Metadata Management: AWS Glue automates schema discovery by analyzing the data and inferring schemas, saving time in data preparation. It also provides a centralized data catalog for managing metadata and facilitating data discovery .
  3. Server-less and Scalable: With AWS Glue, users do not need to manage infrastructure or worry about scalability. It operates in a serverless manner, automatically scaling resources based on the volume of data being processed.
  4. Job Scheduling: Glue can be used according to a schedule, on-demand, or in response to an event. You can also use the scheduler to create sophisticated ETL pipelines by establishing dependencies between tasks.
  5. Automatic code generation. The ETL process automatically generates code, and the only input necessary is a location/path for the data to be stored. The code is in either Scala or Python.
  6. Drag and Drop Interface: Using a drag-and-drop job editor, you can create the ETL process, and AWS Glue will instantly build the code to extract, convert, and upload the data.

AWS Glue Benefits

The benefits of AWS Glue are as follows:

  1. Performance and Scalability: AWS Glue leverages the power of Apache Spark, a distributed processing framework, to handle large volumes of data efficiently.
  2. Time and Cost Savings: AWS Glue automates the time-consuming tasks of data extraction, transformation, and loading. This reduces the manual effort required for ETL processes, resulting in significant time and cost savings.
  3. Faster Time-to-Insights: By automating the ETL process and providing a scalable infrastructure, AWS Glue enables faster data preparation and transformation. This results in reduced time-to-insights, enabling organizations to make data-driven decisions more quickly.
  4. Integration with AWS Services: AWS Glue seamlessly integrates with other AWS services, such as Amazon S3, Amazon Redshift, Amazon Athena, and AWS Lambda. This integration allows for building end-to-end data pipelines and leveraging the capabilities of these services.

AWS Components

AWS Glue relies on the interaction of several components to design and maintain the ETL workflow. The following are the key components :

Data Catalog

AWS Glue Data Catalog provides a centralized metadata repository that stores table definitions and schema information.

AWS Glue depends on the AWS Glue Data Catalog which maintains metadata about sources and targets and executes ETL jobs based on this metadata to move data from sources to targets.

Classifier

AWS classifier is the schema of the data that is determined by the classifier. AWS Glue provides classifiers for common Relational Database Management systems and file types, such as CSV, JSON, AVRO, XML etc...

Connection

AWS Glue Connection is the Data Catalog object that holds the characteristics needed to connect to a certain data storage. AWS Glue supports the following connection types:

  • JDBC
  • Amazon Relational Database Service (Amazon RDS)
  • Amazon Redshift
  • Amazon DocumentDB
  • Kafka
  • MongoDB
  • MongoDB Atlas

Crawler

AWS Crawlers is automated tools that discover data from different sources and discover the schema of the data and create table definitions in the Glue Data Catalog.

Database & Table

A formal group of Data Catalog table definitions that are linked together is known as a database. The table definitions can be added to the Data Catalog.

Job

AWS Glue Job is a business logic that is necessary for ETL work. A transformation script, data sources, and data targets are the components of a job.

Job Scheduler

AWS Job Scheduler, is used for the Glue jobs to be set and called on a flexible schedule, either by event-based triggers or on demand. Several jobs can be started in parallel, and users can specify dependencies between jobs.

Trigger

AWS Trigger will start an ETL process. Triggers can be set to occur at a specific time or in response to an event. Automate jobs with event-based triggers and design dependencies between jobs and crawlers.

Data Source

A data source is a collection of data that is utilized as input to a process or transformation.

Overview of Supported Data Sources

AWS Glue allows you to read and write data from multiple systems and databases including:

  • Amazon S3
  • Amazon DynamoDB
  • Amazon Redshift
  • Amazon Relational Database Service (Amazon RDS) — MySQL, PostgreSQL, SQL Server, Oracle
  • JDBC accessible databases
  • MongoDB and Amazon DocumentDB (with MongoDB compatibility)

AWS Glue can stream data from the following systems:

  • Amazon Kinesis Data Streams
  • Apache Kafka

Data Target

A data target is data storage where the job writes the transformed data.

Overview of Supported Data Target

AWS Glue supports, more or less, all the sources as targets, too, such as Amazon S3, RDS, MongoDB, DocumentDB, and any database that can be exposed using a JDBC connection including AWS RedShift.

AWS Redshift as a Data Target

Overview of AWS Redshift as a Data Warehouse

AWS Redshift is a popular data warehousing solution that enables businesses to store and analyze large volumes of data. In this blog, we will provide a brief overview of AWS Redshift as a data warehouse and discuss how to prepare Redshift for ETL (Extract, Transform, Load) using AWS Glue, a powerful data integration service.

AWS Redshift is designed for handling analytical workloads and offers several key features:

  • Scalability: Redshift allows you to scale your cluster up or down to accommodate changing data needs.
  • Columnar Storage: Data is stored in a columnar format, which enables faster query performance and efficient compression.
  • Massively Parallel Processing (MPP): Redshift leverages parallel processing across multiple nodes to execute queries in a highly efficient manner.
  • Advanced Query Optimization: Redshift’s query optimizer generates optimized query plans to improve performance based on data distribution, sort keys, and compression encoding.

AWS Glue and AWS Redshift are complementary services for data warehousing, therefore AWS Glue can be used as the ETL tool for RedShift. AWS Glue simplifies and automates the ETL process, preparing data for analysis and loading it into data warehouses like AWS Redshift. Here’s why they are used together:

  • Data Integration and Transformation: AWS Glue’s ETL capabilities make it easier to integrate and transform data from various sources into a format suitable for analysis in AWS Redshift.
  • Simplified Data Management: AWS Glue’s data catalog and metadata capabilities complement AWS Redshift’s data management capabilities. The centralized catalog provides a comprehensive view of data assets and management in the data warehousing environment.
  • End-to-End Data Pipelines: AWS Glue’s integration with AWS services allows for building end-to-end data pipelines that cover data extraction, transformation, loading, and analysis in AWS Redshift. This enables a streamlined and automated workflow for data warehousing and analytics.

AWS Glue Studio

AWS Glue Studio is a graphical interface that makes it easy to create, run, and monitor data integration jobs in AWS Glue. AWS Glue will visually compose data transformation workflows and seamlessly run them on the Apache Spark–based serverless ETL engine in AWS Glue.

With AWS Glue Studio, you can create and manage jobs that gather, transform, and clean data. This step makes the service more accessible for complex tasks, such as data processing, without advanced technical skills such as code generation or editing.

Setup Prerequisites

Make sure you have an AWS account with the necessary permissions to create and manage Glue and Redshift resources.

  1. An AWS Account

2. Required Access for the below:

  • Amazon S3 Access
  • Amazon RDS Access
  • Amazon RedShift Access

3. Create the IAM roles for AWS Glue to access and interact with other AWS services.

  • Go to the Console and select IAM
  • Go to the Roles pane on the left side under Access management
  • Click on Create Role

Setup the Environment

Ensure that you have the following resources set up:

  • An S3 bucket containing the source data
  • A MySQL database with the source data
  • An Amazon Redshift cluster or serverless to store the transformed data

Refer : https://medium.com/@arlene.r/introduction-to-aws-redshift-for-modern-data-warehousing-f905458db535

Preparing Redshift for ETL using AWS Glue

To prepare AWS Redshift for ETL with AWS Glue, follow these steps:

  • Set Up Redshift : Create a provisioned or serverless Redshift cluster in the AWS Management Console based on the requirement
  • Define the Redshift Database and Tables: Create a database and tables that will hold the transformed data in Redshift, specifying the column names, data types, and etc…
  • Configure Redshift Security: Configure appropriate VPC, security groups, subnets to allow connectivity between Redshift and AWS Glue. Configure the appropriate permissions and IAM roles

ETL with AWS Glue Service from S3 to AWS RedShift — Hands On

Building ETL jobs with AWS Glue for transferring data from S3 to RedShift.

Here’s a step-by-step approach for using AWS Glue Service and S3 for ETL:

STEP 1 : Create an S3 bucket to store CSV data

Let’s assume that we have a CSV file named “orders.csv” in our S3 bucket.

orders.csv

order_id,restaurant_id,no_of_items,net_total,discount,total,customer_id,order_status,order_type,source,sales_date
79adee2f-8e77-48d8-ba21-8cf3eb1306b8,restaurant2,42,8024,696,7328,93,OPEN,DELIVERY,MOBILE,2022-08-08
fae4d84d-a55f-4053-8b3e-c3ed59dcff1d,restaurant5,51,3522,259,3263,95,CLOSED,PICKUP,WEB,2023-02-04
277e879f-1b66-43e9-9c93-026d3f395401,restaurant3,97,7665,573,7092,,CANCELLED,PICKUP,WEB,2022-06-27
b93a5e10-7b79-4094-82ac-dcaf9d0e0551,restaurant1,55,9491,608,8883,79,CANCELLED,DELIVERY,WEB,2022-07-17
812c1b6e-33d7-457d-b694-6f990b38bfe9,restaurant1,68,9140,989,8151,91,CLOSED,DELIVERY,MOBILE,2022-08-19
e353f30b-ff09-44d9-a759-8416caf6d1db,restaurant4,60,9925,979,8946,,CANCELLED,PICKUP,WEB,2022-09-27
95365fe9-534a-430e-8102-0731f97475f0,restaurant5,17,9715,642,9073,,OPEN,PICKUP,MOBILE,2023-04-27
7b4acdb0-255f-4c5c-9d30-6046e31fdb53,restaurant1,84,1031,817,214,84,CANCELLED,DELIVERY,WEB,2023-05-20
2f67d5dd-7535-4dd2-8302-171709c8c2d5,restaurant5,52,7055,632,6423,4,CLOSED,DELIVERY,WEB,2022-11-17
2b77c424-1c92-4140-88ac-ba2530656ddb,restaurant2,54,4098,368,3730,98,OPEN,PICKUP,WEB,2023-02-21

STEP 2 : Create a Database and a Table in the RedShift Cluster or Serverless

Orders Table

CREATE DATABASE redshift_demo;
CREATE TABLE public.orders (
order_id character varying(256) NOT NULL,
restaurant_id character varying(256),
no_of_items integer,
net_total integer,
discount integer,
total integer,
customer_id integer,
order_status character varying(256),
order_type character varying(256),
source character varying(256),
sales_date date,
PRIMARY KEY (order_id)
)
DISTSTYLE AUTO
SORTKEY(sales_date);

STEP 3 : Create a RedShift Connection

  1. Go to Connectors

2. Click on Create Connection and fill the below.

Once created, it will be available in the list.

STEP 4 : Create a Database

1. Go to Databases and Click on Add Database.

2. Fill the Database name and create the Database.

3. Once created, it will be available in the list.

STEP 5 : Create a Glue crawler and configure to crawl the data in the S3 bucket.

1. Go to Crawlers list and click on Create Crawler.

2. Fill the Crawler name.

3. To select the Source, click on Add a Data Source.

4. Provide the S3 Path

5. Provide the IAM role

6. Select the Target database and Set the Schedule

For the Frequency there are several options as On-Demand, Hourly, Daily, Weekly, Monthly & Custom.

7. Review and Create

8. Once created, it will be available in the list.

STEP 6 : Run the crawler to discover the schema of the data

  1. Click on Run for the relevant crawler

2. Once it starts execution, the window will indicate in green in the banner and the state will change to Running.

3. Once done, it will be in Ready state.

STEP 7 : Check the created Table definitions for the Glue Data Catalog

1. Go to Tables list and click on the table name.

2. Review the Overview and the Scheme. If needed to change the Data Types click on the Edit Schema as JSON.

STEP 8 : Go to AWS Glue Studio and Create a Glue job, specifying the source as S3 and target as Redshift

  1. Select the Source as S3 and Target as RedShift.

2. ETL Job Visual

3. Go to Script

A Glue job is a script that defines the ETL process. You can use Python or Scala to write your Glue job.

4. Define the data source

Define the data source of S3 as Data Catalog Table. You can change the data types if required in the Output Schema.

5. Define the data target

Define the data target as RedShift, Direct connection and select the Schema and Table. You can change the data types if required in the Output Schema.

6. Provide the Job Name and save

STEP 6 : Define the transformation

Perform any necessary data transformations using Glue’s built in transformation functions. Define the transformation that has to be applied to the data. This can include filtering, aggregating, or joining data.

The available transforms:

After selecting any built-in you can define accordingly.

ApplyMapping

SQL Query

STEP 7 : Run the Glue job

Once the Job is defined run it. AWS Glue will automatically provision the necessary resources to run your job, including Spark clusters.

STEP 8 : Monitor the job:

While the job is running, monitor its progress in the AWS Glue Monitoring. View logs to troubleshoot any issues that arise.

Once completed, it will be as below.

STEP 9 : Verify the results

Once the job is complete, verify that the transformed data has been loaded into the target data store.

Go to the RedShift database and the run the below query:

SELECT * FROM "redshift_demo"."public"."orders";

STEP 10 : Schedule the job

ETL process can be run on a regular basis. The Glue job can be scheduled to run at specific intervals using AWS Glue’s scheduling or AWS Trigger feature.

AWS Glue Schedule

1. Go to Schedules and click on Create schedule

2. Provide the details.

AWS Triggers

  1. Go to Triggers and click on “Add Trigger”

2. Provide the details.

3. Add the Target Resource

4. Select the Resource Type : Crawler or Job

5. Review and Create

ETL with AWS Glue Service from RDS to AWS RedShift — Hands On

Building ETL jobs with AWS Glue for transferring data from RDS to RedShift.

Here’s a step-by-step approach for using AWS Glue Service and RDS for ETL:

STEP 1 : Create a Database and a Table in the MYSQL

Let’s assume that we have a database and a table called “CustomerDemo“.

CREATE DATABASE `ar_db`;
CREATE TABLE `CustomerDemo` (
`customer_id` varchar(100) CHARACTER SET utf8mb3 NOT NULL,
`first_name` varchar(50) CHARACTER SET utf8mb3 DEFAULT NULL,
`last_name` varchar(50) CHARACTER SET utf8mb3 DEFAULT NULL,
`Email` varchar(100) CHARACTER SET utf8mb3 DEFAULT NULL,
`Gender` varchar(10) CHARACTER SET utf8mb3 DEFAULT NULL,
`Phone` varchar(15) CHARACTER SET utf8mb3 DEFAULT NULL,
`date_of_birth` date DEFAULT NULL,
`signed_up_date` datetime DEFAULT NULL,
`active_status` bit(1) DEFAULT NULL,
`updated_date` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP
);
  • Insert data
INSERT INTO `CustomerDemo` (`customer_id`,`first_name`,`last_name`,`Email`,`Gender`,`Phone`,`date_of_birth`,`signed_up_date`,`active_status`,`updated_date`) VALUES ('0009eaf5-ec1c-4465-9419-f41f41651415','Julianne','Gagnon','abc1@yahoo.com','FEMALE','1111111111','1996-06-10','2017-11-20 04:32:06','0','2023-05-30 05:58:40');
INSERT INTO `CustomerDemo` (`customer_id`,`first_name`,`last_name`,`Email`,`Gender`,`Phone`,`date_of_birth`,`signed_up_date`,`active_status`,`updated_date`) VALUES ('000e041f-de09-4c7d-8b38-4d6ca94eb18d','Tamara','Diaz','abc2@yahoo.com','FEMALE','1111111111','2000-11-06','2019-11-06 16:19:19','0','2023-05-30 05:58:40');
INSERT INTO `CustomerDemo` (`customer_id`,`first_name`,`last_name`,`Email`,`Gender`,`Phone`,`date_of_birth`,`signed_up_date`,`active_status`,`updated_date`) VALUES ('0037f941-7299-4470-814b-ee1478733a41','Chuck','Marden','abc3@yahoo.com','MALE','1111111111',NULL,'2020-01-23 02:11:00','0','2023-05-30 05:58:40');
INSERT INTO `CustomerDemo` (`customer_id`,`first_name`,`last_name`,`Email`,`Gender`,`Phone`,`date_of_birth`,`signed_up_date`,`active_status`,`updated_date`) VALUES ('0043fc34-7d37-48d4-89c6-6ec5e05be0b9','Johanna','Marian','abc5@yahoo.com','MALE','6789065432','1900-07-17','2019-05-31 05:38:19','0','2023-05-30 05:58:40');
INSERT INTO `CustomerDemo` (`customer_id`,`first_name`,`last_name`,`Email`,`Gender`,`Phone`,`date_of_birth`,`signed_up_date`,`active_status`,`updated_date`) VALUES ('004427b4-491e-4fd2-b1f4-9fbc03a0ef04','Lori','Jagodowski','abc5@yahoo.com','FEMALE','1111111111','1969-05-17','2018-04-18 02:10:09','0','2023-05-30 05:58:40');

STEP 2 : Create a Database and a Table in the RedShift Cluster or Serverless

CREATE DATABASE redshift_demo;
CREATE TABLE IF NOT EXISTS public.customer (
customer_id character varying(256),
first_name character varying(256),
last_name character varying(256),
email character varying(256),
gender character varying(256),
phone character varying(256),
date_of_birth date,
signed_up_date date,
active_status boolean,
updated_date date,
PRIMARY KEY (customer_id)
)
DISTSTYLE AUTO
SORTKEY(updated_date);

STEP 3 : Create a RDS Connection

  1. Go to Connectors

2. Click on Create Connection and fill the below.

Once created, it will be available in the list.

STEP 4 : Create a Database

. Go to Databases and Click on Add Database.

2. Fill the Database name and create the Database.

3. Once created, it will be available in the list.

STEP 5 : Create a new Glue crawler and configure it to crawl the data in your MySQL database.

1. Go to Crawlers list and click on Create Crawler.

2. Fill the Crawler name.

3. To select the Source, click on Add a Data Source.

4. Provide the JDBC source

5. Provide the IAM role

6. Select the Target database and Set the Schedule

7. Review and Create

8. Once created, it will be available in the list.

STEP 6 : Run the crawler to discover the schema of the data

  1. Click on Run for the relevant crawler

2. Once it starts running, it will indicate the state as Running.

3. Once done, it will be in Ready state.

STEP 7 : Check the created Table definitions for the Glue Data Catalog

1. Go to Tables list and click on the table name.

2. Review the Overview and the Scheme. If needed to change the Data Types click on the Edit Schema as JSON.

STEP 8 : Go to AWS Glue Studio and Create a new Glue job, specifying the source as MySQL and target as Redshift data stores.

  1. Select the Source as MySQL and Target as RedShift.

2. ETL Job Visual

3. Go to Script

A Glue job is a script that defines the ETL process. You can use Python or Scala to write your Glue job.

4. Define the data source

Define the data source of MYSQL as Data Catalog Table. You can change the data types if required in the Output Schema.

5. Define the data target

Define the data target as RedShift, Direct connection and select the Schema and Table. You can change the data types if required in the Output Schema.

6. Provide the Job Name and save

STEP 6 : Define the transformation

Perform any necessary data transformations using Glue’s built in transformation functions. Define the transformation that has to be applied to the data. This can include filtering, aggregating, or joining data.

ApplyMapping

SQL Query

STEP 7 : Run the Glue job

Once the Job is defined run it. AWS Glue will automatically provision the necessary resources to run your job, including Spark clusters.

STEP 8 : Monitor the job:

While the job is running, monitor its progress in the AWS Glue Monitoring. View logs to troubleshoot any issues that arise.

Once completed, it will be as below.

STEP 9 : Verify the results

Once the job is complete, verify that the transformed data has been loaded into the target data store.

Go to the RedShift database and the run the below query:

SELECT * FROM "redshift_demo"."public"."orders";

STEP 10 : Schedule the job

The Glue job can be scheduled to run at specific intervals using AWS Glue’s scheduling or AWS Trigger feature.

Conclusion

In this blog, we explored how to use AWS Glue Services can be used to transfer data from S3 and MySQL to Amazon Redshift. We demonstrated scenarios with the use of a crawler. AWS Glue simplifies the ETL process, making it easy to move and transform data between various data stores. By leveraging AWS Glue, you can focus on analyzing and processing your data, rather than managing the underlying infrastructure.

--

--