Automatically Add Partitions to AWS Glue using Node/Lambda Only

Tobin

Published in

Cloud Native Daily

3 min readAug 24, 2019

Situation

An AWS Kinesis Firehose has been set up to feed into S3
Convert Record Format is ON into parquet and mapping fields against a user-defined table in AWS Glue.
Write to S3 is using Hive or Firehose partitioning (or any custom partitioning key prefix)
Glue Table is partitioned along the same lines

Problem

Feeding Kinesis Firehose records into S3 as parquet is fantastic for easy querying using Athena; with auto time-series partitions in Firehose the data can be very efficiently queried. As partitioned data volume is incremented with new S3 objects, partitions must be loaded into Athena / Glue before running queries (when not using partitions everything appears under the single table location).

Crawler

This can be achieved using a scheduled Crawler. Crawlers can get expensive: with a lot of data each crawl takes time, and wanting real-time data requires frequent crawling. Operating on a schedule therefore means trading TTL against cost.

Lambda using S3 events

This removes the scheduling issue by invoking the addition method whenever new objects are added. However it still relies on either scanning the S3 data and ADD missing partitions, or running MSCK REPAIRs. Both require running a query on Athena.

How do the partitions work?

Think of Glue Partitions as a map between a combination of known values and a specific directory. The SDK refers to Values as the array of values mapping to a specific Partition entry, which in turn explicitly states which directory in S3 to search / find objects.

Solution

Take advantage of S3 events to only do anything when there is actually relevant new data
Use the event data object to work out where the data has been added
Use the aws-sdk::glue to query instead of Athena

Setup Firehose to deliver successfully processed records to a path e.g. processed/ and allow it to add the timepath automatically. In this case, it is not necessary to use the Hive formatting. This results in keys like:

processed/2019/08/01/<filename>.parquet

Personally, I add relevant source_records/ and error/ prefixes too.

Create a Lambda Notification subscription in the S3 bucket, filtering on the processed key prefix. Kinesis adds files to S3 via PUT so use the s3:ObjectCreated:Put event (note that the console uses ObjectCreated:Put).

Bucket:
    Type: 'AWS::S3::Bucket'
    Properties:
      NotificationConfiguration:
        LambdaConfigurations:
          - Event: 's3:ObjectCreated:Put'
            Function: !GetAtt 
              - AutoAddPartitionFunction
              - Arn
            Filter:
              S3Key:
                Rules:
                  - Name: prefix
                    Value: processed

Lambda

We should always get Records.length === 1 but just to be sure, loop the Records. The key of the new object can be parsed for some juicy partition Values data. Use glue.getPartition() to find out if this desired partition exists. Using await, if the partition does not exist a 400 EntityNotFoundException is conveniently thrown.

Creating a new partition using glue.createPartition() is quite complex as the StorageDescriptor must be fully populated with e.g. the column data.

Conveniently the StorageDescriptor returned from glue.getTable() is mostly identical; the Table location is at the parent level.

glue.getTable() => StorageDescriptor.Location
=> 's3://<bucket>/processed/'

The new partition location can be determined by concatenating the Table Location with the object key.

For the DatabaseName and TableName environment variables are a good option, but unfortunately referencing them in Cloudformation creates a Circular Dependency :(

Node10.x Lambda index.js

Conclusion

An awesome tool, the only real sticking points I had with Athena was dealing with partitions on time-incrementing data sources. The provided Crawler solution is perhaps more ‘integrated’ however feels a bit clumsy and costs are relatively high. As this lambda only runs intermittently it often requires cold starting and therefore each run can be a bit lengthy (<800ms, $$), but overall it appears to be cheaper and faster than the other methods.

5 Microservices Challenges and Blindspots for Developers

Microservices are loosely coupled services that are organized around business capabilities. In an ideal microservices…

gethelios.dev

Microservices Monitoring: Cutting Engineering Costs and Saving Time

Skip to content We use cookies on our website to give you the most relevant experience by remembering your preferences…

gethelios.dev

Testing Microservices - Trace Based Integration Testing Example

As engineering organizations transitioned from monolith to microservices architectures, they sought to make their…

gethelios.dev

Automatically Add Partitions to AWS Glue using Node/Lambda Only

Situation

Problem

Crawler

Lambda using S3 events

How do the partitions work?

Solution

Lambda

Conclusion

Further Reading

5 Microservices Challenges and Blindspots for Developers

Microservices are loosely coupled services that are organized around business capabilities. In an ideal microservices…

Microservices Monitoring: Cutting Engineering Costs and Saving Time

Skip to content We use cookies on our website to give you the most relevant experience by remembering your preferences…

Testing Microservices - Trace Based Integration Testing Example

As engineering organizations transitioned from monolith to microservices architectures, they sought to make their…

Written by Tobin