Add new partitions in AWS Glue Data Catalog from AWS Glue Job

Anand Prakash
Analytics Vidhya
Published in
5 min readJan 11, 2021

--

Given that you have a partitioned table in AWS Glue Data Catalog, there are few ways in which you can update the Glue Data Catalog with the newly created partitions.

  1. Run MSCK REPAIR TABLE <database>.<table_name> in AWS Athena service.
  2. Rerun the AWS Glue crawler .

Recently, AWS Glue service team has added a new feature (or say parameter for Glue job) using which you can immediately view the newly created partitions in Glue Data Catalog.

To demo this, I will pre-create an empty partitioned table using Amazon Athena Service with target location to S3. I have another S3 location which acts as the data source for creating the AWS Glue DynamicFrame. I will enable AWS Glue Job Bookmark feature to read and process only the newly added objects from source S3 location. The AWS Glue ETL job will process the source data and write the data to target S3 location along with updating the Glue Data Catalog with newly created partitions.

  1. As the first step, I create table orders_history partitioned by year and month. The LOCATION parameter specifies the target S3 location for the table’s data. And if you would have noticed, I am using JSON format.
CREATE EXTERNAL TABLE `orders_history`(
`o_orderkey` bigint COMMENT 'from deserializer',
`o_custkey` bigint COMMENT 'from deserializer',
`o_orderstatus` string COMMENT 'from deserializer',
`o_totalprice` decimal(38,18) COMMENT 'from…

--

--

Anand Prakash
Analytics Vidhya

Avid learner of technology solutions around Machine Learning, Big-Data, Databases. 5x AWS Certified | 5x Oracle Certified.