Amazon and Boto: AWS Glue

Hamza Khan
3 min readNov 11, 2022

--

This image was designed by my wife, referencing boto3 and AWS.

The Boto library was named after the fresh water dolphin, native to the Amazon river. The name was chosen by the author of the original Boto library, Mitch Garnaat, as a reference to the company, Amazon. The botocore went on to become the foundation of the AWS CLI and Boto3.

It was named after the fresh water dolphin native to the Amazon river. I wanted something short, unusual, and with at least some kind of connection to Amazon. Boto seemed to fit the bill 8^).

Mitch Garnaat

Boto3 is an AWS SDK for Python which helps developers use public endpoints for various AWS services, one of which is AWS Glue. AWS Glue provides the workflow feature which enables orchestration of ETL scripts by connecting Glue objects like triggers, jobs and crawlers. This helps in minimizing the coding effort required to create the orchestration flows with visual drag-and-drop components for the above mentioned Glue objects.

The image shows the graph of a very basic workflow on the AWS Glue console.

However, the Boto3 Glue client is useful in development of orchestration scripts for Glue jobs and supporting crawlers. The usage of this client to programmatically orchestrate the ETL workload can help open a world of possibilities due to the flexibility that the client provides.

This client makes available many methods by which AWS Glue jobs can be started, stopped, monitored or tracked for execution history. The following methods are the most basic ones:

A code excerpt in wrapping up these methods can be found in the script linked from GitHub.

The above methods quite simply require arguments and provide a dictionary as a response, most of which are common for most of them and require an understanding of AWS Glue itself. The service is a serverless platform, in which an ETL flow, based on Pyspark, can be created either via a script, a notebook or a visual interface.

  • JobName: The name of the script/notebook/job to use.
  • JobRunId: The string ID of an execution of a particular job, uniquely identifying that run e.g. jr_<alphanumeric string>.
  • Arguments: Each run can be executed by arguments via a dictionary.
  • WorkerType: AWS Glue provides four different worker types: Standard, G.1X, G.2X etc. Each worker type maps to specifications comprising memory, processing power, disk and other resources.
  • JobRunState: The current state of the job which is essential for monitoring multiple jobs and keeping the job execution history and related metadata. The possible states are ‘STARTING’, ‘RUNNING’, ‘STOPPING’, ‘STOPPED’, ‘SUCCEEDED’, ‘FAILED’, ‘TIMEOUT’, ‘ERROR’, ‘WAITING’.

The methods and their usage provide us endless possibilities that cannot be jotted down in a single piece of writing. We will attempt to dive further into this ever flowing Amazon River. FYI, the Amazonian dolphins or botos are born grey and become pinker with age. This is because, as the dolphin grows, its skin becomes more translucent allowing the blood to show through. In the final analysis, translucency of Boto3 methods is ultimately our main objective.

--

--