Implementing Glue ETL job with Job Bookmarks

Anand Prakash
Analytics Vidhya
Published in
3 min readMay 7, 2020

--

AWS Glue is a fully managed ETL service to load large amounts of datasets from various sources for analytics and data processing with Apache Spark ETL jobs.

In this post I will discuss the use of AWS Glue Job Bookmarks feature in the following architecture.

AWS Glue Job Bookmarks help Glue maintain state information of the ETL job and helps process new data when rerunning on a scheduled interval, preventing the reprocess of old data.In a nutshell, Job bookmarks are used by AWS Glue jobs to process incremental data since the last job run, avoiding duplicate processing.

In the above architecture, Kinesis Data Firehose streams events data to S3 bucket referred as raw data store based on buffer size or buffer interval configuration settings. Supposing the condition of buffer interval set at 900 seconds is satisfied first, it triggers data delivery to S3 every 15mins, writing the data to the S3 destination prefix. In Firehose, the S3 destination Prefix is configurable and is optional. As an example, to write the data in hourly partitions, you can set the following Prefix under Amazon S3 destination

event/year=!{timestamp:yyyy}/month=!{timestamp:MM}/date=!{timestamp:dd}/hour=!{timestamp:HH}/

The AWS Glue ETL job is triggered using Glue ETL trigger. As AWS Glue job bookmark…

--

--

Anand Prakash
Analytics Vidhya

Avid learner of technology solutions around Machine Learning, Big-Data, Databases. 5x AWS Certified | 5x Oracle Certified.