Managing AWS Glue Costs and DPU Capacity with Glue Job Metrics | SailPoint

Author: Manoj Guglani

What is AWS Glue?

According to AWS developers guide — “AWS Glue is a fully managed ETL (extract, transform, and load) service that makes it simple and cost-effective to categorize your data, clean it, enrich it, and move it reliably between various data stores and data streams”. AWS Glue is serverless, so there’s no infrastructure to set up or manage. For more details on AWS Glue refer to this excellent AWS Glue documentation.

What is a Glue Job?

A job is the business logic that performs the extract, transform, and load (ETL) work in AWS Glue. When you start a job, AWS Glue runs a user provided script that extracts data from sources, transforms the data, and loads it into targets.

We use Glue to ingest cloud activity data from a customer’s governed account for analysis in our Cloud Access Manager. The data goes through transformation and aggregation. The processed data is stored in an S3 bucket for consumption by other applications like our machine learning pipeline.

Managing AWS Glue Costs

With AWS Glue, you only pay for the time your ETL job takes to run. You are charged an hourly rate, with a minimum of 10 minutes, based on the number of Data Processing Units (or DPUs) used to run your ETL job. A single Data Processing Unit (DPU) provides 4 vCPU and 16 GB of memory. DPU is a configuration parameter that you give when you create and run a job. There are two types of jobs in AWS Glue: Apache Spark and Python shell. In this post we will focus on the Apache spark jobs.

For SailPoint’s SaaS platform, based on the amount of data we needed to process, we estimated the number of DPUs we needed to configure for the glue jobs. To monitor the cost and plan for DPU capacity in future, we enabled Glue Job metrics. We regularly monitor the job metrics to plan for Glue scale and manage the associated costs.

We have found two key areas where the job metrics have come in handy:

  • Coming up with an ideal number of DPUs for your jobs to run. If your job takes less than 10 minutes to run, you are better off using a small number of DPUs. Here is a table with cost comparison for different run times and DPU configurations:

Here is a graph comparing such a run between 2 DPUs and 10 DPUs where the additional DPUs are not being used by the job. Given the cost is higher you are better off staying with 2 DPUs:

The Job Execution metric gives you a view in to coming up with the right number of DPUs.

As the scale of data grew for our customers, the first problem we hit was straggler tasks. As we onboarded a new account, we discovered that a job was taking 10+ hours to run to completion. The job was configured to run with 2 DPUs. 1 DPU was dedicated for application container and 2nd DPU was dedicated for executors. There are 2 active executors per DPU. The metric Job Execution gives a view of number of active executors at a given time, number of completed stages for a job and number of active executors.

For this job we had 500,000 objects to read in our application. The total size of our data was about 1.7 GB. That is, on average each object is smaller than 4 KB. With such tiny objects, the overhead in making an HTTP request for each object was significant. HTTP request was for reading the data from S3 bucket. In estimation, if the round-trip latency to get a single object required 100 ms, then in a second you can read approximately 20 objects with 2 executors. It would require 500,000 / 20 = 25,000 seconds ~ 7 hours to finish reading all the objects and then do further processing. We increased the DPUs to 40 and job execution time reduced significantly to less than an hour. When we had 40 DPU for the job, one was the application master, which then had 78 executors. We significantly increased the parallelism of our application and can visualize it through job metrics. Figures below show five job metrics for an actual job with 40 DPUs. X-axis shows the job execution time and Y-axis shows different metrics:

The second problem we saw at SailPoint appeared as the number of customers and the size of their cloud environments both grew significantly. We had a job that used to run every hour and would finish in less than 10 minutes. Remember I had talked about in the beginning that DPU charges are in increment of 10 minutes. We were running this job with 2 DPUs as the need for more DPUs was not just there. In Q1'2020 we saw a spike in the run time and some of the runs were taking 2+ hours. Having the job metrics enabled came in handy. We looked at the ETL movement metrics and the Job Execution metrics. It was clear that we were processing lot more data and Job Execution metrics gave us a recommendation on maximum needed executors. Let’s look at cost analysis in the table below along with approximate cost:

Based on this, we chose a 10 DPU configuration as best to optimize the combination of job run time and the Glue cost. As you can see from the table, in this case increasing the number of DPUs also brought our cost down.

Cloud providers are giving us great tools to work with and significantly bring down the cost and time required to setup the infrastructure. This ease does not take away the planning and monitoring that the organization leveraging this infrastructure must put in place. At SailPoint, we leverage the Glue job metrics available with AWS Glue infrastructure to monitor the current cost and scale as well as plan for growth.


AWS Glue Developer Guide

AWS Pricing Calculator

Originally published at on June 9, 2020.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store