Demystifying the ways of creating partitions in Glue Catalog on partitioned S3 data for faster insights

Subhash Burramsetty
5 min readMay 4, 2020

--

Introduction:

While working in data engineering projects, one might have come across use case similar to below where realtime streaming data is being ingested into S3 in a partitioned format (YYYY/MM/DD/HH) via Firehose delivery stream which has to be consumed immediately to generate QuickSight dashboards.

A sample end-to-end flow might look as follows:

Sample flow demonstrating generation of QuickSight reports from S3 data

Important point in above workflow is the ability to get insights on latest data as soon as the data gets ingested. We all know that using partitions to scan only the needed data is one of best ways to improve query performance and reduce costs as well. But when you use Athena and try to query the data present inside those new partitions immediately once the files are created in S3, the result of the query will end up returning zero records in new partitions because the new partitions won’t get updated automatically in Glue Data Catalog. In order to populate partitions in Glue Data Catalog, there are multiple ways. Assuming the schema of your data being ingested is fixed, let’s go through multiple options available and validate them which would help us reduce time in creating partitions from minutes to seconds and help us in getting faster insights.

Method 1 — Glue Crawlers:

AWS Glue Crawlers is one of the best options to crawl the data and generate partitions and schema automatically. You can trigger this manually or automate this using triggers. When you have a lot of data in S3, running the crawlers too frequently is a bit too costly. Also, the time taken to generate partitions will be in minutes as it takes a bit of time for crawler to scan the data.

Method 2 — MSCK REPAIR Command:

You can run MSCK REPAIR table query either before running an actual query once a new partition is generated in S3 either through AWS console or it can be automated using S3 Events and Lambda functions. It is completely free of cost. Similar to the first method, if the data in S3 is huge, it takes a lot of time for this command to go through all the data and generate partitions. Also, there are chances of MSCK query getting timed out as well and end up not creating all the partitions and you might need to re-run it to create rest of the partitions.

Method 3 — Alter Table Add Partition Command:

You can run the SQL command in Athena to add the partition by altering tables. You can run this manually or automate this process using S3 event and lambda functions. Since the query is being run using Athena service, it will be subjected to Athena Service Quotas and limits like active DDL/DML queries running at any point of time and so on.

Method 4 Add Glue Table Partition using Boto 3 SDK:

We can use AWS Boto 3 SDK to create glue partitions on the fly. You can create a lambda function and configure it to watch for S3 file created events. Whenever a file gets dropped into S3 via PUT event, the lambda functions gets triggered, extracts the partitions form the S3 object key and creates a partition inside in the Glue Data Catalog.

Lambda functions are Serverless, cheap of cost and our custom code finishes execution in less than two second. Using this method, you can create new glue partitions on the top of existing table on the fly at light speed.

Also, you can always add additional functionalities on top of it like SNS notifications, workflows, triggers and so on based on your needs.

The overall flow will look as follows:

Add Glue Partition using Lambda, Boto3 and S3 Events

Code snippet for Automatic partitioning using Lamdba & Boto3 SDK:

Python 3.7 — Code snippet of lambda function to add new partitions into Glue Data Catalog

Note:

The above method is mainly suggested when the schema is fixed. I would still suggest you to schedule glue crawler to run frequently based on how frequently your upstream schema might get changed in order to make sure you don’t miss any schema changes.

Comparison between above methods for our current use case:

Comparison of above described methods

Coming back to our initial use case, since we need the partitions to be created in Glue Data Catalog as soon as possible with little overhead, Method 1 and Method 2 doesn’t satisfy timing constraint where as Method 3 is subjected to Athena Service Quota limits etc. Of all the methods, Method 4 overcomes multiple constraints to quickly add partitions into Glue Data Catalog and helps in getting faster insights on ingested data. Give it a try and see for yourself.

Update on 28-October-2021:

AWS recently released the following new features w.r.t Kinesis Firehose and Glue Data Catalog which solves the manual implementation which is defined in Method 4 above and this new feature takes care of schema detection on new files as well 🎉 🎉 :

1. Dynamic partitioning in Firehose on custom fields (Aug 31, 2021) — https://aws.amazon.com/about-aws/whats-new/2021/08/introducing-dynamic-partitioning-amazon-kinesis-data-firehose/

2. Glue crawlers inbuilt support for S3 events notifications (October 15, 2021) — https://aws.amazon.com/about-aws/whats-new/2021/10/aws-glue-crawlers-amazon-s3-notifications/

Do check these articles and new features released by AWS as they simplify the overall use case which this article aims to solve.

References:

  1. https://docs.aws.amazon.com/glue/latest/dg/add-crawler.html
  2. https://docs.aws.amazon.com/athena/latest/ug/msck-repair-table.html
  3. https://60devs.com/automatic-creation-of-Athena-partitions-for-Firehose-delivery-streams.html
  4. https://github.com/rewindio/aws-athena-partition-autoloader
  5. https://medium.com/@tobinc/automatically-adding-partitions-to-aws-glue-using-node-lambda-only-a992c124973b

--

--