Build first ETL solution using AWS Glue..

Aditya Sahu
Curious Data Catalog
10 min readMay 16, 2020

In this post, I am going to discuss how we can create ETL pipelines using AWS Glue. We will learn - what is aws glue, how it uses spark, python and how we can create simple but robust ETL pipeline in cloud without any hassle.

So, lets get started…………!

AWS Glue 🤔

According to AWS

AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics. You can create and run an ETL job with a few clicks in the AWS Management Console. You simply point AWS Glue to your data stored on AWS, and AWS Glue discovers your data and stores the associated metadata (e.g. table definition and schema) in the AWS Glue Data Catalog. Once cataloged, your data is immediately searchable, queryable, and available for ETL.

AWS Glue Components

AWS Glue provides several components to build and manage ETL pipelines. There are many components but in this post we would be using only below components, which are very common and minimum necessary to build any ETL pipeline in AWS Glue.

Data Catalog

It is a persistent metadata store, where we can store information related to our data stores in the form of database and tables. This data catalog can be further used to query data using AWS Athena and could also be used as meta store for different other AWS services like Redshift spectrum, EMR.

Crawlers

Crawlers are used to capture metadata information of almost any kind of data store and stores information in data catalog for ETL processing

Okay, much of theory. let’s start creating our first glue job.

Go to AWS Console and search for AWS Glue service. Once you click on AWS Glue, click on Jobs at left side under ETL. Similar to..

AWS Glue console page, having ADD JOB button at top left to add new glue job.

Next click on Add Job, you would find a page to add job details. Similar to..

Okay, before moving forward, let’s have few more theory. If it does not bother you 😅

AWS Glue provides us different options to make our job more efficient and to apply use cases as per our need. I’ll be discussing few of them which are very important to know and are widely used across industry.

AWS Glue Job Parameters

Type

This parameter specifies which type of job we want to be created. There are three types of jobs we can create as per our use case.

Wait we just saw, SPARK….Let’s have a quick catch up on Spark. For those who are not familiar with Spark, I strongly suggest to have a look on spark official documentation as we would be using spark extensively throughout building our ETL pipeline using glue.

SPARK 🤗

It goes without saying, we have spark doc and Wikipedia for spark definition but let’s discuss in short.

Apache spark is a framework used for processing, querying and analyzing big data. Since the computation is done in memory it is much more faster then MapReduce. Some of the features of spark are :

  • Easy to use as we can write our application in Python, R and Scala.
  • It is 100 times faster than MapReduce.
  • ……

AWS Glue provides us flexibility to use spark in order to develop our ETL pipeline.

Python Shell

We can also leverage python shell type job functionality in AWS Glue for building our ETL pipelines. These are very light weight and pure python jobs which can be used for many purposes in our ETL pipelines where we don’t need to require spark. The best example would be running Redshift COPY Command.

Spark Streaming

This is altogether a different topic to discuss but just to know. AWS Glue supports ETL on stream from Amazon Kinesis Data Streams, Apache Kafka, and Amazon MSK.

Python Library Path

We can provide s3 path to our python libraries used in ETL code.

Dependent Jars Path

We can also specify s3 path to any jar file we might want to use while building our ETL code. Mostly required in case of ETL development in Scala or if your job needs some specific version of Spark which is currently not supported by AWS Glue.

Worker Type

We can set type of worker we want to be allocated when our job runs. There are three worker type.

Max Concurrency

We can set concurrency of out glue job. Which means how many glue jobs can be triggered at a time in parallel. This is very important when we have a use case to achieve parallelism in glue.

So..that’s it for glue job parameters, we have some other parameters as well but for our usecase we are good to go with this knowledge.

Before moving forward and building our first ETL pipeline let’s discuss about ETL pipeline we would be creating as part of this post.

Overview of Dataset:

All credit to Kaggle

This dataset contains information about Top 100 celebrity pays since 2005 (though little old but we can leverage this for our use case). The file is csv formatted and contains 1547 rows and 4 columns.

What we are going to do…

We would be creating two glue job which would perform our ETL task.

1. Glue Job to load data into S3.

  • [Extract] Reading celebrity dataset file placed on S3
  • [Transform] Adding pays in INR corresponding to USD
  • [Load] Loading data to S3 bucket in parquet format

2. Glue Job to load data into Postgresql [RDS Instance]. This job is identical to previous one, here purpose is just to explain how we can load into database rather than S3.

3. At-last, We would be loading data into Redshift using COPY Command.

Let’s get started…

Upload Dataset CSV file to Amazon S3

  • On the Amazon S3 bucket, click on create a bucket where you can store your files (aka data)
  • Enter a bucket name, select region and click Next
  • Keep all setting as default, for now.
  • Create new folder in bucket and upload our csv file.

Create a connection for the database in Amazon RDS

  • Prerequisite: You must have an existing Postgres RDS instance in your AWS account.
  • In the AWS Glue console, click on the Add connection in the left pane.

In the dialog box, enter the connection name under Connection name and choose the connection type as JDBC. Click Next to move to the next screen.

  • Enter JDBC URL for your Postgres RDS Instance.
  • Enter user name, password and click Next
  • Review the details of the connection and then click Finish.

Your Postgres RDS Instance connection is now created and can be verified through the Test Connection.

Adding a Crawler to create data catalog using Amazon S3 as a data source

  • On the left pane in the AWS Glue Console, click on Crawlers -> Add Crawler
  • Enter the crawler name in the dialog box and click next.
  • Keep setting as default
  • Keep data store as S3
  • Specify full path to your forbes_celebrity_100.csv file in your bucket and click Next.
  • Keep other settings as it is.
  • Choose or create your IAM Role to create crawler.
  • Keep other settings as it is and click Next.
  • Use existing database or create new one to store crawler output.
  • View all settings and click Finish to create crawler.

Now we have created a crawler which is pointing to our forbes_celebrity_100 dataset placed in S3. When we run our crawler it will crawler to our dataset and capture metadata and will store in data catalog in form of table.

Let’s run the crawler we just created.

You should see message on AWS Glue console something like below

Once, crawler completed running and it’s status is set to Ready, we can see our table created in data catalog.

We can see schema which is being generated by crawler by clicking on above table.

Now, while writing our ETL logic we would be using database as aws_glue_lab_database and table as forbes_celebrity_100_csv

Now, that we have everything setup, we will write our first glue script to perform ETL tasks.

# Necessary imports
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from pyspark.sql.functions import col
from pyspark.sql.types import StringType, IntegerType
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from awsglue.job import Job
import sys

# Any parameters passed to glue job can be taken from this method
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
# Creating spark context
sc = SparkContext()
# Creating glue context
glueContext = GlueContext(sc)

# Creating object of glue job
job = Job(glueContext)
# Initializing glue job with provided arguments
job.init(args['JOB_NAME'], args)

usd_inr_rate = 75.88

# extract source s3 file as glue dynamic frame using data catalogue.
dynf_forbes_list = glueContext.create_dynamic_frame.from_catalog
(database='amp_s3_target_dev', table_name='forbes_celebrity_100_csv', transformation_ctx="s3_src_extract")

# Convert glue dynamic frame to spark dataframe
df_forbes_list = dynf_forbes_list.toDF()

# simple transformation logic to convert pay from usd to inr and column renamed to match with column in redshift table.
df_transformed_forbes_list = df_forbes_list.withColumn('pay_inr_millions', col('pay (usd millions)')*usd_inr_rate)\
.withColumnRenamed('pay (usd millions)', 'Pay_USD_millions')
# Here we are making sure that order in file should match with order in redshift table. limitation of copy command with parquet
df_transformed_forbes_list = df_transformed_forbes_list.
select(col('Name').cast(StringType()), col('Pay_USD_millions').cast(IntegerType()), col('Pay_INR_millions').cast(IntegerType()), col('Year').cast(IntegerType()), col('Category').cast(StringType())
)
# Conver spark dataframe to glue dynamic frame again in order write to s3 using glue api.
dynf_tgt_dataset = DynamicFrame.fromDF(df_transformed_forbes_list, glueContext, 'forbes_transformed_data')

# load s3 file using glue api.
glueContext.write_dynamic_frame_from_options(frame=dynf_tgt_dataset,
connection_type="s3",
connection_options={"path": 's3://*****************'},
format="parquet",
format_options={},
transformation_ctx="")
job.commit()

Now that we have our ETL script ready, we create our glue job which would be using this script to load data into s3.

Keep all other properties as is and run the job. Once the job get succedded we should see our data in S3.

Now that we have our transformed data available in S3 bucket we can utilize Redshift COPY Command to load data into Redshift, which is the most effecient and recommended way to load data into Redshift.

Prerequite: Should have Redshift cluster created.

Below is the details of my redshift cluster:

For more details related to Redshift cluster please refer AWS official documentation.

Now let’s create a table in Redshift which will store our data.

create table if not exists public.s3_redshift_load(
Name varchar(100),
Pay_USD_millions integer,
Pay_INR_millions integer,
Year integer,
Category varchar(100)
);
# We are not using any sort key and distribution key (by default its auto) for this tutorial.

Once we run above DDL we should be having our table ready in Redshift cluster.

We can run queries on this table as we do with any other workbench.

select * from public.s3_redshift_load;

Now that we have both source (s3) and target (Redshift table) ready we can simply run copy command to load data into Redshift table.

COPY public.s3_redshift_load 
FROM 'full path to s3 file'
iam_role 'arn:aws:iam::aws_account_number:role/name_of_iam_role'
format as parquet;

Once we run our copy command we should be able to see our data into Redshift table.

Now if we run above query again, we should be able to see our data in Redshift table.

Kudo’s We have now created our first ETL pipeline whic reads data from S3 transforms it and writes to S3 (which is kind of staging in our pipeline) than we used copy command to load data into Redshift.

We are still left to develop another pipeline which would extract transform and load data into Postgres and also to have a look on Athena (believe me this is interesting), not to make this tutorial heavy, I will be covering up this in my next post, please do keep watch.

--

--