🚀AWS Glue Explained — The ETL Process Key Components

Janani Thesu Vasudevan
5 min readJun 19, 2023

--

What is Glue?

AWS Glue is a fully managed Extract, transform, and load (ETL) service provided by Amazon Web Services (AWS).
Glue provides tools and features to automate the process of
🔹Data ingestion
🔹Data cleansing
🔹Schema evolution and
🔹Data transformation.
AWS Glue is also useful to organize, clean, verify, and format data in preparation for storage in a data warehouse or data lake.
Glue focuses on data preparation and ETL processes for making data analytics-ready.

Key Components of AWS Glue :

1. Data Crawler :

Data Crawler is used to scan various data sources like Amazon S3, Amazon RDS, JDBC databases, and DynamoDB stores to check for the incoming data to infer the schema and create metadata tables in the Glue Data Catalog.
It is also responsible for automatically discovering and cataloging data sources.

♦️ How can I use Crawler in AWS Glue:
AWS Glue users can crawl multiple data stores in a single run using Glue Crawler. Upon completion, the crawler creates or updates one or more tables in your Data Catalog.
You can make Crawlers to scheduled to run periodically or triggered manually to keep the metadata up to date. A Crawler assists in the creation and updating of Data Catalog Tables

Image Source : Amazon

2. Data Catalog :

The Glue Data Catalog is a central metadata repository that stores metadata information about various data sources, including tables, databases, and schemas.
The Data Catalog can be used by other AWS services, such as Amazon Athena and Amazon Redshift Spectrum, to access and analyze the data.

♦️ How can I use Data catalog in AWS Glue:
AWS Glue Data Catalog
tracks runtime metrics, and stores the indexes, locations of data, schemas, etc. It basically keeps track of all the ETL jobs being performed on AWS Glue. You use the information in the Data Catalog to create and monitor your ETL jobs.

3. Glue Studio

It allows you to create visual interface in AWS Glue that simplifies the design and creation of ETL workflows.
The Glue studio basically helps the ETL developers to to create repeatable processes to move and transform large-scale, semi-structured datasets, and load them into data lakes and data warehouses.

♦️ How can I use Glue Studio in AWS Glue:
AWS Glue Studio can be used to create a simple visual interface to create ETL workflows for data cleaning and transformation, and run them on AWS Glue.
I can monitor the AWS Glue Studio job using Cloudwatch , this collects and process the raw data. Glue Studio provides a drag-and-drop interface for connecting data sources, transformations, and destinations, and it automatically generates Python or Scala code based on the visual workflow.

4. Glue Jobs :

Its a job that runs extract data from different sources, transform and load it into target systems. AWS Glue triggers can start jobs based on a schedule or event, or on demand.
Jobs can also run general-purpose Python scripts (Python shell jobs).
The interesting thing about creating Glue jobs is that it can actually be an almost entirely GUI-based activity, with just a few button clicks you can auto-generate the necessary python code.

♦️How can I use Glue Jobs in AWS Glue:
Glue Jobs provide the flexibility to apply data transformations, perform data cleaning and enrichment, and load data into destinations like Amazon S3, Amazon Redshift, and others.

5. Triggers :

Glue triggers are used to start one or more crawlers or extract, transform, and load (ETL) jobs. Using triggers, you can design a chain of dependent jobs and crawlers. You can specify constraints, such as the frequency that the jobs or crawlers run, which days of the week they run, and at what time. These constraints are based on cron.

♦️ How can I use Glue Jobs in AWS Glue:
Triggers in AWS Glue allow you to automate the execution of Glue Jobs based on events or schedules. You can create triggers to run jobs based on time-based schedules or in response to events such as data arrival in an S3 bucket.
you can also create a conditional trigger , When you create a conditional trigger you specify a list of jobs ,list of crawlers to watch. For each watched job or crawler, you specify a status such as succeeded, failed, timed out, and so on.

6. Development Endpoints :

AWS Glue Development Endpoints, This feature provides an environment for development, testing, and debugging Glue scripts. Development Endpoints support development tools like Jupyter notebooks and integrated development environments (IDEs).

♦️ How can I use Development Endpoints in AWS Glue:
They allow you to write and execute Glue scripts using Python or Scala interactively without the need to provision and manage additional infrastructure.
It allows you to interactively write and execute Glue scripts using your preferred development tools, making the ETL development process more efficient and productive.

7. Glue DataBrew :

DataBrew is a Visual data Preparation tool , It is easy for Data Scientist and data analysts to clean and normalize data and prepare it for analytics and machine learning .
There are about 250 pre-built transformations to automate data preparation tasks which you can choose without the need to write any code.

♦️ How can I use DataBrew in AWS Glue:
With the intuitive DataBrew interface, you can interactively discover, visualize, clean, and transform raw data. DataBrew makes smart suggestions to help you identify data quality issues that can be difficult to find and time-consuming to fix.
It reduces the time it takes to prepare data for analytics and machine learning by up to 80% compared to traditional approaches to data preparation.

8. Glue Workflow:

Its a way of Automating ETL processing, Glue Workflow is a visual representation of the ETL process, enabling you to define, schedule, and orchestrate ETL tasks. It provides a graphical interface for creating and managing complex data pipelines, dependencies, and triggers. Each workflow can manage the execution of all components and monitor all added jobs and crawlers.

♦️ How can I use Glue Workflow in AWS Glue:
In AWS Glue, Workflows are used for creating and visualizing complex ETL activities involving multiple crawlers, jobs, and triggers. Each workflow can manage the execution of all components and monitor all added jobs and crawlers.

🥳Happy Learning Folks🥳

--

--

Janani Thesu Vasudevan

Hi Everyone, I'm Janani TV 😎 Welcome to My Blog🥳 For Titbits on AWS Cloud Stay connected with me in Linkedin - Janani Thesu Vasudeavn