Partitioning Challenge on AWS with S3, Glue, and Athena

Erdem Çulcu
MobileAction Technology
4 min readDec 1, 2022

We’ll continue our challenge on the previous article with a different perspective, we will talk about the partitioning problem that causes the system to change shape.

What is partitioning?

In S3, we can use folders as Partitions to group our data into physical entities. However, in Glue Data Catalog or Hive Metastore, we use partitions as logical entities with metadata properties.

S3 Folder Structure

Why is partitioning important?

Optimizing query capabilities is important when using data warehouses and data lakes like S3 instead of RDBMS.

In our environment on AWS, we are using Athena to query our data on S3. By partitioning our data, we can restrict the amount of data scanned by each query, thus improving performance and reducing the cost.

In S3, you can partition your data by any key. A common practice is to partition the data based on time, often leading to a multi-level partitioning scheme. For example, a customer who has data coming in every hour might decide to partition by year, month, date, and hour.

Athena And Partitions

Athena can use Apache Hive style partitions, whose data paths contain key-value pairs connected by equal signs (for example, country=us/... or year=2021/month=01/day=26/...). Thus, the paths include both the names of the partition keys and the values that each path represents.

To load new Hive partitions into a partitioned table, you can use the MSCK REPAIR TABLE command, which works only with Hive-style partitions.

Athena can also use non-Hive style partitioning schemes. For example, CloudTrail logs and Kinesis Data Firehose delivery streams use separate path components for date parts such as data/2021/01/26/us/6fc7845e.json.

For such non-Hive style partitions, you can use ALTER TABLE ADD PARTITION to add the partitions manually.

Glue Data Crawler can be used to add new partitions metadata when Data Catalogs are used. It seeks all partitions on S3 and adds new partitions' metadata into the catalog.

Partition Limits

Amazon Web Services support partitioning on your data but, almost all of the tools have limits for partitions.

Kinesis Data Firehose: Default 500 active partitions in a buffer. It can be up to 5000 from AWS Support. We talked about this previous article.

Glue Data Catalog: Max partitions per table for each supported region is 10,000,000.

Our Usage and Challenges with Partitioning

In the first iteration, we set up three Kinesis Data Firehoses for different partition keys and different S3 locations. The limits weren’t enough for a bucket that was partitioned daily by high range date. This was the first challenge with partitioning. The partition keys were very important for our queries. Hence, we could not give up on it. We had to find a different solution to solve this problem.

Partitioning with Glue Spark

The solution was a Glue Spark Job for partitioning the data within a second iteration according to our keys and copying it to another S3 bucket. We set partition keys as the date and customer id and execute scheduled spark jobs to prepare data for our needs.

Updating the Glue Data Catalog

After completing the spark jobs, a second problem appeared. The new partitions from Spark Job were not recognized by the Data Catalog and Athena queries. We’ve already discussed some solutions to this problem, but:

  • MSCK REPAIR TABLE is very slow for huge tables.
  • Data Crawler is an extra cost and needs more time for huge tables.

Partition Projection

In partition projection, partition values and locations are calculated from configuration rather than reading from a repository like the AWS Glue Data Catalog. Since in-memory operations are often faster than remote operations, partition projection can reduce the runtime of queries against highly partitioned tables. Depending on the specific characteristics of the query and underlying data, partition projection can significantly reduce query runtime for queries that are constrained on partition metadata retrieval.

For example a partition projection :

PARTITIONED BY (
datehour STRING
)
LOCATION "s3://DOC-EXAMPLE-BUCKET/prefix/"
TBLPROPERTIES (
"projection.enabled" = "true",
"projection.datehour.type" = "date",
"projection.datehour.format" = "yyyy/MM/dd/HH",
"projection.datehour.range" = "2021/01/01/00,NOW",
"projection.datehour.interval" = "1",
"projection.datehour.interval.unit" = "HOURS",
"storage.location.template" = "s3://DOC-EXAMPLE-BUCKET/prefix/${datehour}/"
)

Tataaa!! We had a new problem :) Our customer id range is 1– 300.000. If we set the customer id and date, data catalog partition limits are exceeded. Since these fields are the main WHERE clauses in our queries, are directly affecting our query performance.

Of course, we are very happy that we have 300 thousand registered users and this number continues to grow.

Partition Injection

Luckily partition injection comes to the rescue. The problematic field has the following characteristics:

  • An extremely high number (potentially billions) of values.
  • Since its values are random strings, it is not projectable using other projection methods.
  • The extremely large number of partitions cannot be stored in commonly used meta stores.

However, if all of your queries include a WHERE clause that filters for only a single id, you can use the following approach in your CREATE TABLE statement.

PARTITIONED BY
(
device_id STRING
)
LOCATION "s3://bucket/prefix/"
TBLPROPERTIES
(
"projection.enabled" = "true",
"projection.device_id.type" = "injected",
"storage.location.template" = "s3://bucket/prefix/${device_id}"
)

A SELECT query on this table would look like the following:

SELECT
col1,
col2,...,
device_id
FROM
table
WHERE
device_id = "b6319dc2-48c1-4cd5-a0a3-a1969f7b48f7"
AND (
col1 > 0
or
col2 < 10
)

End of the day

After these experiences, we partitioned buckets with the required partition keys. Our queries are working with high performance. Our cloud adventure stories will be continued. If you wonder more, please keep following us.

--

--