Automatically Add Partitions to AWS Glue using Node/Lambda Only

Tobin
Cloud Native Daily
Published in
3 min readAug 24, 2019
Photo by Arnold Francisca on Unsplash

Situation

  1. An AWS Kinesis Firehose has been set up to feed into S3
  2. Convert Record Format is ON into parquet and mapping fields against a user-defined table in AWS Glue.
  3. Write to S3 is using Hive or Firehose partitioning (or any custom partitioning key prefix)
  4. Glue Table is partitioned along the same lines

Problem

Feeding Kinesis Firehose records into S3 as parquet is fantastic for easy querying using Athena; with auto time-series partitions in Firehose the data can be very efficiently queried. As partitioned data volume is incremented with new S3 objects, partitions must be loaded into Athena / Glue before running queries (when not using partitions everything appears under the single table location).

Crawler

This can be achieved using a scheduled Crawler. Crawlers can get expensive: with a lot of data each crawl takes time, and wanting real-time data requires frequent crawling. Operating on a schedule therefore means trading TTL against cost.

Lambda using S3 events

This removes the scheduling issue by invoking the addition method whenever new objects are added. However it still relies on either scanning the S3 data and ADD missing partitions, or running MSCK REPAIRs. Both require running a query on Athena.

How do the partitions work?

Think of Glue Partitions as a map between a combination of known values and a specific directory. The SDK refers to Values as the array of values mapping to a specific Partition entry, which in turn explicitly states which directory in S3 to search / find objects.

Solution

  1. Take advantage of S3 events to only do anything when there is actually relevant new data
  2. Use the event data object to work out where the data has been added
  3. Use the aws-sdk::glue to query instead of Athena
Representative Cloudformation template

Setup Firehose to deliver successfully processed records to a path e.g. processed/ and allow it to add the timepath automatically. In this case, it is not necessary to use the Hive formatting. This results in keys like:

processed/2019/08/01/<filename>.parquet

Personally, I add relevant source_records/ and error/ prefixes too.

Create a Lambda Notification subscription in the S3 bucket, filtering on the processed key prefix. Kinesis adds files to S3 via PUT so use the s3:ObjectCreated:Put event (note that the console uses ObjectCreated:Put).

Bucket:
Type: 'AWS::S3::Bucket'
Properties:
NotificationConfiguration:
LambdaConfigurations:
- Event: 's3:ObjectCreated:Put'
Function: !GetAtt
- AutoAddPartitionFunction
- Arn
Filter:
S3Key:
Rules:
- Name: prefix
Value: processed

Lambda

We should always get Records.length === 1 but just to be sure, loop the Records. The key of the new object can be parsed for some juicy partition Values data. Use glue.getPartition() to find out if this desired partition exists. Using await, if the partition does not exist a 400 EntityNotFoundException is conveniently thrown.

Creating a new partition using glue.createPartition() is quite complex as the StorageDescriptor must be fully populated with e.g. the column data.

Conveniently the StorageDescriptor returned from glue.getTable() is mostly identical; the Table location is at the parent level.

glue.getTable() => StorageDescriptor.Location
=> 's3://<bucket>/processed/'

The new partition location can be determined by concatenating the Table Location with the object key.

For the DatabaseName and TableName environment variables are a good option, but unfortunately referencing them in Cloudformation creates a Circular Dependency :(

Node10.x Lambda index.js

Conclusion

An awesome tool, the only real sticking points I had with Athena was dealing with partitions on time-incrementing data sources. The provided Crawler solution is perhaps more ‘integrated’ however feels a bit clumsy and costs are relatively high. As this lambda only runs intermittently it often requires cold starting and therefore each run can be a bit lengthy (<800ms, $$), but overall it appears to be cheaper and faster than the other methods.

--

--

Tobin
Cloud Native Daily

CTO @ Proofenance - KYC & AML checks at point of sale in-store.