AWS Glue ETL tool libraries…

2 min readMay 10, 2024

Glue crawler libraries are a set of libraries provided by AWS Glue, a fully managed extract, transform, and load (ETL) service that makes it easy to prepare and load data for analytics. The Glue crawler libraries are used to create and manage crawlers, which automatically discover and catalog metadata about your data assets in various data stores.

Here are some of the key libraries used for Glue crawlers:

import os

"""The OS module in Python provides functions for creating and removing a 
directory (folder), fetching its contents, changing and identifying the 
current directory, etc."""

import sys
"""The sys module in Python provides various functions and variables that are 
used to manipulate different parts of the Python runtime environment."""

import boto3
"""Boto3 is a Python module that allows developers to interact with 
Amazon Web Services (AWS) resources programmatically. It provides an 
easy-to-use interface to AWS services, making it easier for developers 
to build applications that interact with AWS services."""

import zipFile
"""ZIP is an archive file format that supports lossless data compression. 
By lossless compression, we mean that the compression algorithm allows the 
original data to be perfectly reconstructed from the compressed data."""

from io import BytesIO
"""Just like what we do with variables, data can be kept as bytes 
in an in-memory buffer when we use the io module’s Byte IO operations."""

from awsglue.job import job
"""It is a Python module used to define and manage Glue jobs programmatically."""

from awsglue.transfrom import *
"""It is a module provided by AWS Glue, a fully managed extract, transform, 
and load (ETL) service."""

from awsglue.utils import getResolvedOptions
"""It is a Python module provided by AWS Glue, which is a fully managed 
extract, transform, and load (ETL) service that makes it easy to prepare 
and load data for analytics."""

from awsglue.context import GlueContext
"""It provides information and methods for interacting with AWS Glue resources
within your ETL scripts."""

from pyspark.context import SparkContext
"""pyspark.context refers to the SparkContext object, which is the entry 
point to any Spark functionality in PySpark."""

from pyspark.sql.functions import *
"""It is a module in PySpark that provides various built-in functions for 
manipulating and transforming data in Spark DataFrames."""

from datetime import date, time, datetime, timezone, timedelta
"""The datetime module in Python provides classes for manipulating dates 
and times."""

import pandas as pd
"""Used for data manipulation and analysis, particularly for smaller
datasets that can fit into memory."""

Summary:

When using AWS Glue, it’s crucial to consider the scale and performance of your data processing tasks, as well as any specific requirements or constraints imposed by your data sources and destinations.

Happy Learning !!!

About me

I’m available on LinkedIn. For any assistance book slot https://topmate.io/ganesh_r0203. Please stop by if you like to say ‘Hi’.

AWS Glue ETL tool libraries…

About me

Written by R. Ganesh