ETL Workflow Guide using Glue Studio with S3, and Athena
Create, Run, and Monitor ETL with AWS
The ‘dark ages,’ when paper forms and filing cabinets ruled, have passed. Today, it’s the ‘golden age’ in which databases are everywhere, and the technology of tomorrow won’t stop there - it will be the era of decision-making based on data analytics. You’ve probably heard how:
- Tesco increased beer sales by placing them next to diapers.
- Abraham Wald — a Hungarian-born mathematician — used data-driven insights in WW2 to protect planes from enemy fire.
- Netflix used big data to enhance user experience and increased its customer base.
To use data for such decision-making, we can’t use traditional OLTP (Online Transaction Processing) systems — we need to pump our data from our databases to the data warehouse. Here, data ETL (Extraction, Transformation, and Loading) comes in handy to manage the process. In this article, we’ll discuss how to perform ETL with AWS, using event-driven, serverless computing platform AWS Glue.
Getting Started with AWS
First, to sign in to the console, we need an AWS account. You’ll have to input details of a valid credit card, debit card, or other payment method, even when you create an account under the free tier. Don’t worry- they won’t charge a penny without letting you know. After creating an account, sign in to the console as a root user.
Step 1: Create a bucket
In AWS Management Console, we can search for S3 and create a new bucket to store our data. We are building a database in the S3 object storage capable of holding substantial objects up to 5 TB in size.
So, here I name my new bucket medium-bkt
, selecting the region EU(Ireland)eu-west-1
and keeping the default options for the rest.
I create a simple CSV file with some dummy data and upload the CSV to the bucket I made earlier.
Step 2: Define the Crawler using AWS Glue.
In 2017, AWS introduced Glue — a serverless, fully-managed, and cloud-optimized ETL service. To use this ETL tool, search for Glue in your AWS Management Console. There are multiple ways to connect to our data store, but for this tutorial, I’m going to use Crawler, which is the most popular method among ETL engineers. This Crawler will crawl the data from my S3, and based on available data, it will create a table schema.
To add Crawler to my S3 datastore:
- I give Crawler the name
medium-crawler
, and clicknext
. - Keeping the Crawler source type on the default settings, I click
next
again. - I select S3 as the datastore and specify the path of
medium-bkt
, and clicknext
. (Note: here, if you want to add a connection, you have to complete the S3 VPC endpoint set up)
4. I select an existing IAM (Identity and Access Management) role radio button. Then, to create an IAM role, I go to the IAM Console, which will direct me to the IAM Management Console.
- Click
create a role
. - Select
Glue
under the use cases, and clicknext
. - Tick
administrator access
under the permission policies, and clicknext
. - Adding IAM tags is optional, so for this tutorial, let’s continue without adding tags.
- Give a preferred role name, and click the
create-role
button.
5. Now, let’s come back to the Crawler creation. Select the created IAM role from the dropdown and click next
.
6. Set the frequency as run on demand
and click next
. (If needed, you can schedule your Crawler hourly, daily, weekly, monthly, etc.)
7. To store the Crawler output, I create a database called employee-database
, select the created database from the dropdown, and click next
.
8. Finally, review all the settings that have been configured, and click finish
to create the Crawler.
Now that the Crawler has been created click medium-crawler
and run it. If the Crawler job status changes from starting-status
to stopping-status
to ready-status
, the Crawler job has been successful.
The Crawler job will automatically create the tables in our database. It will also automatically detect the number of columns on our CSV file and their data types.
If the Crawler job ends up with an endpoint error:
Check that you have an Amazon S3 VPC endpoint set up, which is required with AWS Glue. If you haven’t set up VPC previously, here’s how
- Go to the Amazon VPC Console and select
endpoints
under the virtual private cloud. - Click
create an endpoint
, selectcom.amazonaws.eu-west-1.acm-pca
under the service names and take the default options for the rest.
Step 3: Define the Glue job
Finally, we are done with the environment, and now I define a Glue job to perform the data ETL part in AWS. Then I go back to the AWS Management Console, search for the Glue, and select AWS glue studio
and click jobs
.
As above, I select Source and target added to the graph
and it will direct us to a canvas where we can define source to target mapping. Here remember to give a name (in my case employee job
) for the Glue job, otherwise, it will return an error.
To define source to target mapping:
- I click
data source
and select the source database and the table
- Then, I click
transform
and give the transform logic asselect fields
and select theid
andname
fields from the transform tab.
- Finally, I click
data target
and specify the target path. (I created a new S3 bucket calledtarget-medium-bkt
.)
- I click
job details
and select the IAM role which we created in Step 2. - Now we can save the job and run.
Step 4: Query the data with Athena
AWS Athena is the query service that AWS offers to analyze data in S3 using standard SQL. In this tutorial, I use Athena to query the ETL data and perform some SQL query operations. In AWS Management Console, we can search for Athena, and there you can see the medium-bkt
table which is automatically created while we perform the ETL to employee-database
.
But before running my first query, I need to set up a new bucket in S3 to store the query output. So, I go back to the S3 dashboard and create a new folder called query-output
inside my medium-bkt
. I then come back to Athena and specify the path of the query result location as shown below.
Finally, now I can either query my source and target table and see the results, or analyze the data using SQL queries.
Summary
Performing an ETL is a significant aspect of a typical data engineer’s work. There are many cloud-based ETL tools out there, such as AWS Glue, Azure Data Factory, Google Cloud Air Fusion, etc. Regardless of which you choose, all of them will help reduce the development effort by providing a base from which to start, and providing the accessible manageable serverless infrastructure.
In this tutorial, we have discussed how to perform Extract Transformation Load using AWS Glue. I hope this will help with your projects — if you find any points worth mentioning that have been missed, please put them in the comments below.
Finally, I should point out that these services are not free, but the best thing about using AWS is you can pay as you go.