Easy Serverless ETL with AWS Glue

A step-by-step guide including partitioning and cost summary

Wallace Zhang
DotsOfThought
6 min readJul 17, 2018

--

AWS Glue is AWS’ serverless ETL service which was introduced in early 2017 to address the problem that “70% of ETL jobs are hand-coded with no use of ETL tools”. In this article I will go through an example to show how simple and easy to use AWS Glue to crawl two csv files into partitions, convert them to parquet format (a columnar format for query efficiency) and query them using Athena without any single server instance. This is a step-by-step guide with screenshots covers everything you need to start ETL with AWS Glue including detailed partitioning steps and cost summary which are missing from similar articles and talks(listed at the end).

  1. Get some public data you are interested and upload to s3.

I got the college scoreboard data from here: Zipped CSV

I chose two years of data to show Glue crawler can automatically partition your data using Hive-style partitioned paths.

2. Create and run a crawler to crawl through the csv files and generate a table with schema.

Run crawler:

A table named rawdata is created by Glue which uses the folder name for the table name.

Glue also partitions the data using your s3 folder names. In my case, I have two folders named schoolyear=14–15 and schoolyear=15–16, it generates two partitions like below: (click on View partitions to see them).

3. Create and run an ETL job.

Since we have the table now, we can start create a Glue ETL job to convert the csv file to parquet. The job will use the table as the data source. We will specify the data target which is another S3 folder (which is named transform) for Glue job to put the parquet files.

You can choose which columns will be in the converted parquet file.

You can also change the data types for the columns in the parquet file. Click on the Data type, you will be able to update it.

Review the job:

After click on Save job and edit script, the generated python script is ready.

Now go back to Jobs view and run the job there.

Click on Run job. You are prompted with this Paramters dialog. You can enable the Job bookmark so that if you drop a new file in the source folder, Glue will not process the old files, only work on the new file.

The job is running now. Wait several minutes. It should either give you a Run Status Succeeded or Error.

If the job is succeeded, it should write several parquet files in the transformed folder.

We have successfully converted the two text files in csv format to a bunch of files in parquet format.

4. Create another crawler for the parquet files so that we can query the data more efficiently than the csv files.

A new table is generated from the collegedata-parquet crawler.

5. Use Athena to query the data.

Since Athena and Glue are integrated, we can use Athena to query the table.

6. We have completed an ETL process on some public csv files using the latest AWS serverless technologies.

A couple of caveats. Make sure the IAM role you created has the get and put permission to the transformed folder.

Every time you run Athena query, it will generate a new csv file as the query result. For production, you may want to cache the query result id in a DynamoDB table.

7. Cost

You may be curious how much it costs to run the serverless ETL using Glue and Athena. Here are the cost details for the work I prepared for this blog. As you can see, most of the money are for the Glue ETL job which is counted by DPU-hour. I have run the job probably 10 times to try out different things. The charge price here is for all the job runs and it is under $5 which is not bad at all. Another good news is Athena is super cheap too. It only costed 1 penny for all the queries I have done for this blog.

Hopefully this step-by-step gives you the information you need on AWS Glue and Athena so you can start to try serverless ETL yourself!

Similar articles and talks:

https://medium.com/localz-engineering/serverless-big-data-start-here-aws-glue-athena-quicksite-4c70ecac9fe3

https://github.com/aws-samples/serverless-data-analytics/tree/master/Lab3#query-the-partitioned-data-using-amazon-athena

--

--