AWS Glue in a Nutshell

Sven Leiß
awsblackbelt
Published in
5 min readJan 17, 2023
Photo by Sigmund on Unsplash

Data pipelines are a critical component of any data-driven organization. AWS Glue is an easy-to-use, serverless ETL (extract-transform-load) service that makes it easy to move data between various data stores. With AWS Glue, you can quickly create data pipelines that ingest, cleanse, transform, and load data from a variety of sources and destinations.

Serverless Spark Cluster (or not)

If you are coming from BigData world, the first association for the glue could be:

  • Serverless Autoscaling Apache Spark Cluster
  • + Hive-like Data Catalogue
  • + Crawlers to populate the Catalogue

and you would not be wrong. If you already have you Spark jobs in Python or Scala you can simply have them run on AWS Glue. (Your data of course should be accessible — ”AWS-ideally” in S3, but any JDBC Connectable Source should work.) But …

…Or not…

When you hear AWS people talking about Glue, they are not emphasis the “Serverless Spark Cluster” part. One can wonder why this could be a case? Apache Spark is de-facto standard with its in-memory data processing approach. It works so well, and clearly every data-engineer would be happy for a managed solution. Isn’t that a good sales-point?

It seems that, AWS it taking an other approach: “Technology changes and evolves, AWS Services are there to stay”. Yes, they started with Spark, offered wrappers on top of Data Frames, but then they announced Python engine?! Wait … Pandas are awesome, but they are not for really BigData, right? Python does not scale well on clusters. But with the announcement of AWS Glue for Ray (Preview) on the last reinvent you start to see where they are going. Suddenly your Python code runs on the cluster “out-of the box”.

And that is just one of the directions. They want to be friendly to “non-data-engineers” and offer awesome User Experience.

Low-Code Approach

AWS Glue offers a graphical editor that would generate code in the background. So if you are Low-Code lover, the only reason you would need to do some coding is to write special transformations that are still not available.

No-Code Approach

AWS Glue DataBrew is a service, on top of AWS Glue offering people to create recipes to transform data, generating in this way full data pipelines, writing really No-Code.

Let’s move now to a more general non-techy introduction.

Potential use cases for AWS Glue data pipelines

AWS Glue can be used to: …

Data Warehousing

… to prepare data for data warehouses and data marts. This can include ingesting data from various sources, cleaning and transforming the data, and loading the data into a data warehouse or data mart.

Machine Learning

… to process data in machine learning applications. This includes ingesting data, cleaning and transforming the data, and loading the data into a machine learning model.

Data Visualization

… to create pipelines for data visualization applications. This includes ingesting data, cleaning and transforming the data, and loading the data into a visualization platform.

Data Analysis

…. as data engine in data analysis applications. This includes ingesting data, cleaning and transforming the data, and loading the data into a data analysis platform.

Data Security

... as data engine security applications. This includes ingesting data, cleaning and transforming the data, and loading the data into a security platform.

What is AWS Glue?

Let’s discuss the components of a data pipeline on AWS. An AWS Glue data pipeline consists of several components.

Data Sources

These are the sources of data that will be ingested and processed by the pipeline. Examples of data sources include Amazon S3, Amazon DynamoDB, and Amazon Redshift.

Glue Jobs

A AWS Glue job is a script that performs a specific data transformation or data cleansing task. Examples of AWS Glue jobs include transforming data from one format to another, cleaning data, and loading data into a data warehouse.

Glue Crawlers

Crawlers are used to discover and catalog data from various data sources. This includes identifying the data structure and metadata of the data.

Glue Triggers

Triggers are used to schedule AWS Glue jobs to run at certain times or based on certain events.

How to setup AWS Glue?

Ready -> Set -> Go || Catalogue -> Crawler -> Job

To set up AWS Glue, you’ll need to log into the AWS console and create a a database in Glue Data Catalog (Hive-Like database). Once the Data Catalog is created, you can configure the AWS Glue Crawlers to discover and catalog the data sources. Once the Data Catalog is configured, you can create Glue jobs to perform the data transformations and data cleansing tasks needed for the data pipeline. Glue jobs can be written in Python and Scala. In the backend it can use Apache Spark or Python Engines.

Orchestration of Jobs

You can also use the AWS Glue console to create and manage Glue jobs: you can configure AWS Glue triggers to schedule the Glue jobs to run at certain times or based on certain events. This allows you to create a fully automated data pipeline on AWS. For more complex scenarios, being well integrated in the whole AWS Ecosystem, you can use AWS Step Functions to orchestrate even the most demanding pipelines. Creating and managing data pipelines on AWS can be complex and time-consuming. However, with the help of AWS Glue, you can quickly create and manage data pipelines on AWS. By leveraging the components of AWS Glue, you can easily create and manage data pipelines that are reliable and secure.

Benefits of using AWS Glue

Cost Savings

AWS Glue is a serverless service, meaning there is no need to manage or provision servers. This can lead to significant cost savings in comparison to running on-premises ETL services.

Scalability

AWS Glue is highly scalable and can handle large volumes of data with ease.

Security

AWS Glue offers a secure environment for data pipelines. It is compliant with HIPAA and other industry-specific security standards.

Speed and Efficiency

AWS Glue is designed to be fast and efficient. It can process data quickly and efficiently, allowing you to quickly create data pipelines.

Automation

AWS Glue can be used to automate data pipelines. This includes scheduling jobs and triggers to run at certain times or based on certain events.

So what are you waiting?

Log into AWS Console and try it out!

Sven Leiß — Senior Manager, MHP, Cloud Architecture and Development

Stanko Petkovic — Manager, MHP, Cloud Architecture and Development

--

--