Demystifying the ways of creating partitions in Glue Catalog on partitioned S3 data for faster insights

Introduction:

While working in data engineering projects, one might have come across use case similar to below where realtime streaming data is being ingested into S3 in a partitioned format (YYYY/MM/DD/HH) via Firehose delivery stream which has to be consumed immediately to generate QuickSight dashboards.

Sample flow demonstrating generation of QuickSight reports from S3 data

Method 1 — Glue Crawlers:

AWS Glue Crawlers is one of the best options to crawl the data and generate partitions and schema automatically. You can trigger this manually or automate this using triggers. When you have a lot of data in S3, running the crawlers too frequently is a bit too costly. Also, the time taken to generate partitions will be in minutes as it takes a bit of time for crawler to scan the data.

Method 2 — MSCK REPAIR Command:

You can run MSCK REPAIR table query either before running an actual query once a new partition is generated in S3 either through AWS console or it can be automated using S3 Events and Lambda functions. It is completely free of cost. Similar to the first method, if the data in S3 is huge, it takes a lot of time for this command to go through all the data and generate partitions. Also, there are chances of MSCK query getting timed out as well and end up not creating all the partitions and you might need to re-run it to create rest of the partitions.

Method 3 — Alter Table Add Partition Command:

You can run the SQL command in Athena to add the partition by altering tables. You can run this manually or automate this process using S3 event and lambda functions. Since the query is being run using Athena service, it will be subjected to Athena Service Quotas and limits like active DDL/DML queries running at any point of time and so on.

Method 4 Add Glue Table Partition using Boto 3 SDK:

We can use AWS Boto 3 SDK to create glue partitions on the fly. You can create a lambda function and configure it to watch for S3 file created events. Whenever a file gets dropped into S3 via PUT event, the lambda functions gets triggered, extracts the partitions form the S3 object key and creates a partition inside in the Glue Data Catalog.

Add Glue Partition using Lambda, Boto3 and S3 Events

Code snippet for Automatic partitioning using Lamdba & Boto3 SDK:

Python 3.7 — Code snippet of lambda function to add new partitions into Glue Data Catalog

Comparison between above methods for our current use case:

Comparison of above described methods

Update on 28-October-2021:

AWS recently released the following new features w.r.t Kinesis Firehose and Glue Data Catalog which solves the manual implementation which is defined in Method 4 above and this new feature takes care of schema detection on new files as well 🎉 🎉 :

References:

  1. https://docs.aws.amazon.com/glue/latest/dg/add-crawler.html
  2. https://docs.aws.amazon.com/athena/latest/ug/msck-repair-table.html
  3. https://60devs.com/automatic-creation-of-Athena-partitions-for-Firehose-delivery-streams.html
  4. https://github.com/rewindio/aws-athena-partition-autoloader
  5. https://medium.com/@tobinc/automatically-adding-partitions-to-aws-glue-using-node-lambda-only-a992c124973b

--

--

Architect — AppDev & Data at Presidio

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store