Exploring AWS Glue Part 1

This is the first in a series of exploring the AWS Glue service.

Published in

Acing AI

4 min readJan 18, 2021

There is a long list of growing applications you can use AWS Glue for. It wouldn’t be possible to cover them in one blog post, hence we will explore them in this blog series. I’ll get into the motivation of this series at the end of the article :)

What is it?

From its service page, AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development.

So… what does that mean exactly? Well, let’s break that down.

Serverless data integration service > AWS manages the underlying details while you can focus on the goodies using either the AWS Console Page OR the AWS API/SDK
Easy to discover, prepare, and combine > Your one-stop shop for data management in AWS’ world :)
analytics, machine learning, and application development > Glue can serve as the ‘source of truth’ data provider that multiple teams can rely on for their data needs.

What can you use it for?

From its service page, it lists the following use cases:

Build event-driven ETL (extract, transform, and load) pipelines
Create a unified catalog to find data across multiple data stores
Create, run and monitor ETL jobs without coding
Explore data with self-service visual data preparation
Build materialized views to combine and replicate data (in preview)

The last two are the most recent use cases added as of late 2020.

I have used AWS Glue several times. In my first encounter with it, we used it to create a serverless database to analyze our usage of AWS S3 as we worked to reduce our costs. In another case, as part of our reporting pipeline where it served as our data lake with the underlying data sitting on S3. Most recently, I played around with AWS Glue Studio to create a transformation job without writing any code, which was very neat.

Why does it exist?

To make AWS money of course!

Plus to help make our lives easier!

You can order the above as you see it…

Machine learning and data analytics projects are very exciting and all the rage these days. At their core, they require the same starting ingredient — data!

Managing data is difficult. Putting data from multiple systems into one coherent place where your team can access it for either reporting, analytics or machine learning purposes isn’t easy.

To better understand the value that AWS Glue provides, we need to look at what you would have done in the pre-Glue days:

Let’s keep it simple: You would have set up a server to extract data from one or multiple data sources (databases, 3rd party services via APIs, etc). Over time, the number of data sources you needed to extract from would increase along with the volume of data you were extracting. You would have to adapt your extraction code to handle this, while also maintaining your server. This adds up to a lot of work!

Wouldn’t it be so nice if we could get away from managing anything except the core transformations we need to do on the data we are extracting?

How does it work?

Under the hood, AWS Glue uses other AWS services to orchestrate our ETL jobs. This involves taking care of provisioning and managing the resources that are required to run our workloads.

This solves the problem highlighted in the last section. Setting up and managing ETL infrastructure is a pain point and blocker to that exciting machine learning or data analytics project you have jumping around in your head. Being able to focus on the differentiating and exciting work because someone else is managing the underlying infrastructure is very empowering.

Thanks for reading! The motivation for starting this series was due to the lack of articles covering AWS Glue and AWS CDK. AWS Cloud Development Kit (CDK) is AWS’s more recent open-source project which allows us to write infrastructure-as-code in the same programming languages we use to write apps (Python, JavaScript, TypeScript, Java, .NET). Under the hood, it converts our code into AWS CloudFormation. So I wanted to explore both the capabilities of AWS Glue while taking advantage of Infrastructure as Code to create something repeatable and easily shareable. We will get our hands dirty with code and deployments in the course of this series.

References

Subscribe to our Acing AI newsletter, if you are interested:

Subscribe to the Acing AI/Data Science Newsletter. It is FREE! Reducing the entropy in data science. Helping you with…

www.acingdatascienceinterviews.com

Interested in learning how to crack machine learning interviews?

Acing Data ScienceData science Interviews — Course

Acing Data Science

Data science Interviews — Course Acing Data Science www.acingdatascienceinterviews.com