The ‘Small Files Problem’ in AWS Glue

Leah Tarbuck
5 min readJul 23, 2020

--

AWS Glue is a pay-as-you-go serverless extract, transform and load (ETL) tool, using Apache Spark under the covers to perform distributed processing.

AWS Glue takes care of the provisioning and management of the resources required to perform the ETL for you, so you only really need to worry about the code… right? That being said… when you see nasty out of memory exceptions thrown by the driver, you need to understand a bit about how Spark works in order to resolve them.

In this very simple scenario, imagine we’ve written some ETL code in a Glue job which reads files from an S3 bucket, performs some file conversion and writes files back out to S3. We execute the job…

The Glue console provides you with running metrics and memory profiles for each job you create and execute, which helps you spot job abnormalities and performance issues like the one below!

You can see the memory of the driver (blue line), exceeds the threshold of 50 percent and once it reaches 100, the job fails with a OOM and is killed. The executors (green line) haven’t started to use any memory yet.

In this instance, the transformation performed in the ETL isn’t the issue here. There is no data movement and the tasks haven’t been distributed to the executors yet. So what’s going on?

Spark Driver

The Spark Driver is the heart of a spark application, it runs the main method, creates the SparkContext, converts the user program into units of physical execution (tasks) and distributes/schedules these across the executors. It does this by initially creating a DAG (directed acyclic graph) or execution plan (job). It also communicates with the Cluster Manager (e.g. YARN) and requests for resources to launch the executors.

The main problem here is that Spark will make many, potentially recursive, calls to S3’s list(). This method is very expensive for directories with a large number of files. The driver keeps track of metadata for each file it reads and keeps this in memory.

In our case, we had millions of small files sitting in our S3 bucket waiting to be read, this resulted in the driver quickly exhausting it’s 10G of memory (which is the maximum you can provide the driver when using Standard Worker nodes, more on this below!).

Solutions?

We had hit the ‘Small Files Problem’ in AWS Glue in a very big way! Here are a few things we investigated to try to rectify this issue.

Glue’s Dynamic Frames

A recommendation by AWS (documented here) was to use Glue’s Dynamic Frame grouping option on the Spark read. This grouping is automatically enabled when you’re reading more than 50,000 files from S3. It allows the driver to keep track of a group of files, rather than recording memory for each individual file. All implementation examples we found online were written in python, we used the following to create Dynamic Frames in Scala:

glueContext
.getSourceWithFormat(
connectionType = "s3",
options = JsonOptions(Map("paths" -> s3Paths, "groupFiles" -> "inPartition", "useS3ListImplementation" -> true)),
format = "xml",
formatOptions = JsonOptions(Map("rowTag" -> "our-row-tag"))
)
.getDynamicFrame()

Where ‘s3Paths’ were a list of S3 Object keys to iterate though.

When you set ‘useS3ListImplementation’ to True, Glue doesn’t cache the list of files in memory all at once, instead it caches in batches. So the driver is less likely to run out of memory.

However, with further reading, we realised this would not be suitable for the majority of our datasets:

“With AWS Glue grouping enabled, the benchmark AWS Glue ETL job could process more than 1 million files using the standard AWS Glue worker type.”

Therefore this would still not provide a long-term solution for us, we were exceeding this for many of our datasets, in some cases we had over 20 million files to process!

Changing the Glue Worker Types

Another option, was to provide the driver with much more memory. Currently, Glue provides three different worker types:

  • Standard — the default worker type, maps to 1 DPU (Data Processing Unit). A Standard worker has 16 GB memory (for the executors), 4 vCPUs of compute capacity and 50GB of attached EBS storage with two Spark executors. The maximum amount of driver memory you can provide is 10GB.
  • G.1X — maps to 1 DPU. This worker consists of 16 GB memory, 4 vCPUs, and 64GB of attached EBS storage with one Spark executor. The maximum amount of driver memory you can provide is 10GB.
  • G.2X — maps to 2 DPUs and allocates twice as much memory, disk space, and CPUs as G.1X worker type with one Spark executor. The maximum amount of driver memory you can provide is 20GB.

In this case, we’re not interested in what memory the executors have, only the driver, so the only option available which could provide some benefit was the G2.X worker.

It is worth noting that running a G.2X worker costs twice as much as a Standard worker (providing you run the jobs for the same amount of time and are using the same number of executors). As AWS charges $0.44 per DPU per hour (with a 10 minute minimum charge) and a G.2X worker uses 2 DPU’s compared to a Standard worker’s 1 DPU. (Documentation here).

Although this provided a short term fix, here we’re configuring vertical scaling, not horizontal, which goes against the principle of distributed processing.

Concatenating the Files

A more optimal solution (and one which we opted for), was to write another process which concatenated the small files into larger ones before they reached the source S3 bucket. To stay consistent with the serverless architecture approach, concatenation was performed via a Lambda before Glue processed the files. The workflow was orchestrated by an AWS Step Function.

There were multiple advantages to using this final approach:

  • Fewer writes/reads to S3 resulting in a substantial reduction in cost.
  • The driver would not need to keep track of so many small files in memory, so no OOM errors!
  • Reduction in ETL job execution times (Spark is much more performant when processing larger files).

This provided a more long-term solution compared to increasing the driver’s memory or using Glue’s Dynamic Frames.

Although AWS Glue hides a lot of Spark’s processing complexity, sometimes you have to go through the pain of understanding what’s hidden in order to reap its full potential.

I hope this has helped others in their development! :)

--

--

Leah Tarbuck

Software Engineer at BlackCat Technology Solutions, likes cats, running and yoga :)