AWS Glue ETL tool libraries…

R. Ganesh
2 min readMay 10, 2024

--

Glue crawler libraries are a set of libraries provided by AWS Glue, a fully managed extract, transform, and load (ETL) service that makes it easy to prepare and load data for analytics. The Glue crawler libraries are used to create and manage crawlers, which automatically discover and catalog metadata about your data assets in various data stores.

Here are some of the key libraries used for Glue crawlers:

import os

"""The OS module in Python provides functions for creating and removing a
directory (folder), fetching its contents, changing and identifying the
current directory, etc."""

import sys
"""The sys module in Python provides various functions and variables that are
used to manipulate different parts of the Python runtime environment."""

import boto3
"""Boto3 is a Python module that allows developers to interact with
Amazon Web Services (AWS) resources programmatically. It provides an
easy-to-use interface to AWS services, making it easier for developers
to build applications that interact with AWS services."""

import zipFile
"""ZIP is an archive file format that supports lossless data compression.
By lossless compression, we mean that the compression algorithm allows the
original data to be perfectly reconstructed from the compressed data."""

from io import BytesIO
"""Just like what we do with variables, data can be kept as bytes
in an in-memory buffer when we use the io module’s Byte IO operations."""

from awsglue.job import job
"""It is a Python module used to define and manage Glue jobs programmatically."""

from awsglue.transfrom import *
"""It is a module provided by AWS Glue, a fully managed extract, transform,
and load (ETL) service."""

from awsglue.utils import getResolvedOptions
"""It is a Python module provided by AWS Glue, which is a fully managed
extract, transform, and load (ETL) service that makes it easy to prepare
and load data for analytics."""

from awsglue.context import GlueContext
"""It provides information and methods for interacting with AWS Glue resources
within your ETL scripts."""

from pyspark.context import SparkContext
"""pyspark.context refers to the SparkContext object, which is the entry
point to any Spark functionality in PySpark."""

from pyspark.sql.functions import *
"""It is a module in PySpark that provides various built-in functions for
manipulating and transforming data in Spark DataFrames."""

from datetime import date, time, datetime, timezone, timedelta
"""The datetime module in Python provides classes for manipulating dates
and times."""

import pandas as pd
"""Used for data manipulation and analysis, particularly for smaller
datasets that can fit into memory."""

Summary:

When using AWS Glue, it’s crucial to consider the scale and performance of your data processing tasks, as well as any specific requirements or constraints imposed by your data sources and destinations.

Happy Learning !!!

About me

I’m available on LinkedIn. For any assistance book slot https://topmate.io/ganesh_r0203. Please stop by if you like to say ‘Hi’.

--

--