Using Jira to Automate Updations and Additions of Glue Tables

Published in

Credit Saison (India)

11 min readApr 23, 2023

Fig: The most helpful automation may involve the creation of the simplest workflow.

We have been using AWS Glue Crawlers to create table catalogues (to save the schema of a table) for a while now. For the past 3 years, we had to run a script to update Glue Tables from time to time. This script not only re-ran the Glue Crawler but also had a bunch of pre-run and post-run stages to resolve some minor issues that are baked into Glue intrinsically. To solve these issues and ensure that Glue Crawler runs can be automated such that any person can run them at any time, we created a Jira Based tool.

The Basic Architecture

What you see below is a Data Lake that can be created using AWS infra. You have a Postgres Database (can be any other database too, such as MySQL). The CDC from the database is captured by the AWS DMS (Database Migration Service), which writes the data to an S3 bucket. Here CDC stands for Change Data Capture and what it means is that any update, create or delete operation happening in any schema and table in that particular database is captured and streamed via AWS DMS. The DMS saves this data is the S3 bucket in a format of your choice. We prefer parquet since it is lighter and cheaper and faster to query as compared to CSV.

Fig: A data lake built using AWS services

The folder structure in which the files are saved is–schema_name/table_name/partitions/file.parquet.

Now if you open these parquet files and use a tool to read them, all you will find are entries like this:

time_stamp = 2023-01-06 11:17:35.780687
id = 18
created_at = 1606917194748
updated_at = 1606917194748
city = KOLKATA
country = IN
pin_code = 700054

time_stamp = 2023-01-06 11:17:35.780698
id = 19
created_at = 1606917246208
updated_at = 1606917246208
city = MUMBAI
country = IN
pin_code = 382440

While this data is saved in the S3 bucket, for you to be able to query this data, you also need to create Glue Tables–one glue table for each Postgres table, where each Glue table contains the schema and some more information related to the corresponding Postgres Table.

Fig: The Glue Schema for a Postgres table

When the user wants to fetch the data lying in the S3 bucket, he or she will use AWS Athena, which will internally connect to the S3 bucket and the Glue tables to get the data and the table schema–only then will it be able to run a query. The query won’t run if either the Glue Table or the Data in the S3 bucket is missing.

Now we can create Glue Tables on our own but in such a scenario, where we are getting hundreds of tables via AWS DMS into our S3 bucket, we can use AWS Glue Crawler, to crawl the bucket and use its intelligence and create Glue Tables on its own. However, there are a few problems with the tables created by the Glue Crawler.

What are the problems with AWS Glue Crawler?

While Glue does a good job of ascertaining the schema of most tables, some issues that are commonly faced are:

Tables in the same schema (Postgres schema) which have similar table schemas (structure or columns) may be merged into a single table with multiple partitions.
When the crawler crawls multiple tables with the same names (in different Postgres Schemas), such as “loan”, it would name the first one as “loan”, the next one as “loan_123asdasd123”, following one as “loan_153jkdfshjk23w4” and so on–random names which will change each time the crawler runs. You can refer to this article by AWS to understand more about this- How crawlers work — AWS Glue.
The tables created by the crawler will ascertain the table schema from the files in the data lake- which is something we want. But there are certain properties that we want to change. For example, the tables will capture the partitions that exist (data partitions). Instead of using the captured partitions, we want to go with partition projections so that both historical, current and future data can be queried via Athena. To understand how and why, please refer to–How Partition Projection can Save Your life and Money.

Despite these flaws, AWS Glue Crawler does one thing best- the main heavy lifting- that is figuring out the table structure of hundreds of tables in our data lake.

What is the Solution?

We added stages before and after the auto-generation of the schema to get our desired results.

Before running the crawler, we go through all the folders in each schema and check if any new tables have been added, and add the specific folders for the tables that need to be parsed. Putting specific table folders in the S3 paths of the crawler (such as root/partners/userdata/) instead of the top-level folder (such as root/) ensures that each table is crawled separately and no two tables are joined into 1 table with 2 partitions.
Next, we run the crawler and generate the schema for each table. Inadvertently, the crawler also adds partitions based on the date folders present in each table folder.
Now that we have these tables auto-generated, we copy the table schema of each table, generate the name of the table by appending the schema name (for example address table inside user_data schema gets the name user_data_address), and add certain properties to it. The properties are the same for every table. Given that the data for these tables reside in S3 buckets, the Table is of the EXTERNAL type. The other properties are related to partition projection, which we will be using so that we can query data from any date that falls between 2019/11/01 to NOW, where NOW will be dynamically updated to represent the current date, and the former date represents the first date partition for our CDC. Hence, as new data comes in every day, you will be able to query it via Athena.

How did we further refine the solution?

Initially, we were doing all this using a single massive script that we would run from our local system. Given that it would crawl more than a thousand tables, the crawler took 5–6 hours to run — a waste of both time and resources since we would run the crawler on all the tables even if we had just added or updated one.
Instead, we have come up with an internal tool, that you can use to update a single Postgres Schema. Updating a schema would involve the addition of new tables that were added, as well as updating existing tables in that schema that have been updated. So suppose you have a schema called user_data, in which you have added 3 tables and added new columns to 5 tables. You can then run the tool on just user_data schema and it will add the new columns to the 5 tables and add the 3 new tables in Glue Catalogue. This way, you can run the tool only on the schemas that have new tables added or updated.

So how does our internal tool look like?

We created a custom Jira Ticket with just 3 stages. Open, Executed and Not Executed. As inputs for the Jira ticket, one had to fill certain fields like-

Environment
Region
Schema for which the crawler must run (The crawler will update existing tables for that schema and add any new tables too)

Once this Jira Ticket is created by anyone in the company (who has access to the JIRA Board), a step function is triggered and all the fields that the user had previously provided are passed on into the Step Function.

Fig: How Jira and AWS Step Function fit into the equation

The Step Function Workflow that is triggered by Jira Events looks somewhat like the next image. I say somewhat and not exactly this because you may want to tweak it a bit based on your specific requirements. Also note that all the stages except the choice state in the Step Function below are Python3.7 Lambdas.

Fig: Step Function for Automated Glue Table Update

So let’s go step by step.

Step 1- CrawlerStateCheck: We check if the crawler in question is already present and currently running. If it is indeed, the choice state takes us to the end step- UpdateJiraTicket in which we mark the ticket as Not Executed and add a comment that the crawler is already in a running state.

Step 2- CreateCrawler: In case the output of the CrawlerStateCheck was false, we go to this step. Here we check if the crawler for that particular Postgres Schema already exists (using glue_client.get_crawler()), and we go to the next step directly. In case the crawler for the schema doesn’t exist, we create the crawler (using glue_client.create_crawler()). This crawler will only crawl the folders for the tables in this particular schema in the S3 bucket.

Step 3- AddTableLocations: In this step, we sift through the S3 folders for the schema in question and find the folder paths for all the tables inside that Schema (using S3_client.list_objects()). Once we get all the paths for the tables in that Postgres Schema, we will be adding those to the Glue Crawler that we created in the last stage (or which already existed), using glue_client.update_crawler().

Fig: A crawler with S3 locations for all tables added separately

Step 4- ClearLoadFiles: This stage is an optional one which you may or may not need. In case you have “partitioning enabled” in your DMS task, and your task is set to “Full Load and CDC”, you will need this step. In case you do not have partitioning enabled, or your task is just set to CDC, you will not need this.

When you create a DMS task of type “Full Load and CDC” and have partitioning enabled on top of it, the first copy of the existing data in the database is put into one or more LOAD files, These are placed in the root location of the table’s S3 folder. So if your table name is “users”, and it is in a schema called “partner_data”, the load file/s will have S3 keys looking like this-

partner_data/users/LOAD00000001.parquet

Now all data that comes after this full load of the initial data- that is the CDC data - will be put in folders based on the partitioning logic that is set. Since we have used the date in YYYY/MM/DD format for partition, our CDC files have an S3 key like this-

partner_data/users/2023/02/22/20230222–071340817.parquet

The problem with such storage is that when querying the data via Athena, the Load file will be ignored and you will not have access to the data in that file (that is the initial data pull from the database). So this lambda moves all such load files into a partition of the format YYYY/MM/DD (in the table’s S3 folder that it belongs to) by picking up the date of creation of the load file.

One thing that you may notice is that I have kept this state as a MAP state. The reason behind this is that we have grouped the number of table locations in S3 and ran the ClearLoadFiles lambda on a set number of locations for each iteration. This ensures that if there is a schema with too many tables (say 1000), and we need to clear the load file for all 1000 tables, the lambda doesn’t time out. This is because the lambda will run 20 times, processing 50 S3 table locations each time.

Step 5- RunCrawler: This step does exactly what its name is. No matter whether the crawler existed before, or the CreateCrawler stage just created it, the RunCrawler state runs the crawler (using glue_client.start_crawler( Name=crawler)). That’s it, just a one-line lambda here.

Step 6- CrawlerRunCompleteCheck: This is a vital step where we check if the crawler that was run in the previous step has stopped running. We check, and then we wait for 300 seconds or 5 minutes and then we check again. We use the configuration given below in the step function to specify this retry logic.

"Retry": [
              {
                "ErrorEquals": [
                  "States.TaskFailed"
                ],
                "IntervalSeconds": 300,
                "MaxAttempts": 100,
                "BackoffRate": 1.0
              }
            ]

Only when glue_client.get_crawler(Name=crawler_name) gives a response whose response[‘Crawler’][‘State’] value is “READY”, we move to the next step. Else the step keeps waiting and checking and failing.

Step7- UpdateGlueTables: This step adds the finishing touch to the beautiful picture that we painted in the previous steps. Here, we will update the Glue tables that are already created via the crawler. We want to keep the schemas for the Glue tables, but update the name and add partition projection to the tables. Theoretically, you cannot update the name of the Glue Table. So what we do is delete the old table and create a new one. Let me explain with an example.

Suppose we have a table called “users” in a schema called “partner”. By default, the name of the Glue table created by the crawler will be either “users” or “users_asuidhasiduyg128hdasjdh”, where the value followed by users is a hashed suffix. Quoting from this AWS document, “If duplicate table names are encountered, the crawler adds a hash string suffix to the name.”.

So, we copy the schema of this table and create a new table with the name format schema_table_name. So for the table in the example, the name will be “partner_users”. While creating this new table, we also add some partition projection parameters:

{'EXTERNAL': 'TRUE',
'projection.enabled': 'true',
'projection.partition_date.format': 'yyyy/MM/dd',
'projection.partition_date.interval': '1',
'projection.partition_date.interval.unit': 'DAYS',
'projection.partition_date.range': '2019/11/01,NOW',
'projection.partition_date.type': 'date'}

To know more about why these values are set, read: How Partition Projection can Save Your life and Money.

At the end of this step, we should have recreated all the tables in the schema for which the crawler ran, with alternative names and added partition projections.

One interesting fact is that just like the ClearLoadFiles stage, this one too is a Map state and I have ensured that every lambda run only handles a set number of tables so that no lambda timeout occurs and also for faster processing (Step function can run around 10 lambdas concurrently for a map state).

Step8- UpdateJiraTicket: In this step, we move the Jira ticket to the Executed state, which is the end state of the ticket. We also add some comments on the ticket mentioning which are the tables that have been created/recreated for the given schema. This makes things easier since someone usually creates the ticket to run the crawler and add newly added Postgres tables to Glue. He or she can verify the table additions by looking at the Comments in the Jira Ticket.

So this tool uses the best of AWS Glue Crawler and adds on to it by using the “auto-generated” schema for a particular table and changing the name and adding partition projection parameters. It also ensures that all the individual S3 paths of the tables are added in the glue crawler instead of it running at the root level and merging tables that may seem similar. Load files are also handled for the specific scenario of Partitioning + FullLoadAndCDC.

Feel free to recreate this at your end or use a part of it to make Glue Tables easier to work with!