Multithreading/Parallel Jobs in AWS Glue

Vikas Singh
Analytics Vidhya
Published in
3 min readSep 21, 2020

On AWS based Data lake, AWS Glue and EMR are widely used services for the ETL processing. AWS Glue is a specialized service for ETL. It has various components which help us to build a robust ETL system.

In this article, I would like to explain the multi-threading approach in AWS Glue Job to process data faster.

This article assumes that you are aware of the basics of the following technologies:

1 — AWG Glue

2 — Spark

3 — Python

ETL processing will always have data-wrangling jobs, which will read staged data, perform validations, lookups and write it into the target.

Sometimes it happens that the job reads stage, performs some data enrichment, does validation, and then use this half-cooked data into two different target processing. As depicted below:

Using partial cooked data for two different Target

More often than not this could turn into a performance bottleneck. This could be handled by one of the following approaches:

1 —Create two jobs - one for each target and perform the partial repetitive task in both jobs. This could run in parallel, however this could be inefficient.

2 — Split the job into 3, first will perform all the common tasks and stage the data. Then, the other two jobs will execute target specific tasks. This will increase I/O and partially send us into the MR era.

3 — Use multithreading to submit two jobs in parallel to Spark.

This article will discuss the 3rd approach that I have used and it works like charm.

Spark context is thread safe, setLocalProperty can be used to set thread local configurations.

For demo purpose, I will use below stage and dim tables:

Stage and dim sample data
Target tables

To configure Spark in FAIR mode, we need to update spark.scheduler.mode to FIFO and provide fairscheduler.xml for the pool’s configuration file location in referenced file paths .

fairscheduler.xml is must, otherwise Glue container will skip the scheduler setting

I have tried without it and got below log in job.

Below is the sample code to configure:

I have used below fairscheduler.xml. In this, it has created two pools (queues) with different weights.

Glue will create 3 pools, 2 configured one and 1 default. This can be seen in below logs. It will assign both jobs into different pools and thus enabling processing both jobs in parallel.

“Being a student is easy. Learning requires actual work.”
— William Crawford

Please share your thought on this, Happy Learning.

--

--