How Partition Projection can Save your Life and Money

Abhisek Roy
Credit Saison (India)
6 min readOct 10, 2022
Fig: Partition Projection can be the only way to maintain partitions for all your glue tables–providing you a set it up and forget it option.

Let’s take a scenario. You get files from a particular source from time to time, and drop them into a s3 bucket of yours. These are CSV files and have files keys like:

s3://your_bucket/partner_data/file_type=user_data/date=08–09–2022/file_timestamp.csv

As clear from the file key, there are two partitions that are being used here, the “date” and the “file_type”. Let’s figure out both separately.

For dates, you can face one of the following scenarios:

  1. You get files daily, so you could run a lambda function and add the partition for date T on date T-1. The downside is that unless you are getting files daily, you will add extra partitions, which will slow down your queries. Also, the lambda running every day will incur an avoidable cost (although minor).
  2. You get files from time to time. So, instead of adding partitions daily, you create an S3-Trigger, which would add partitions when a file is dropped in the bucket and the particular partition doesn’t already exist. The problem here is that suppose you get a million files in a day, the trigger will run a million times and the lambda will execute a million times, although your partition will get created just once, when the very first file is dropped. Such a waste of resources and money.
  3. You create a CRON job to run every 15 minutes, check for files that may require new partitions to be added, and accordingly add the new partitions–again a waste of resources since the cron would often run finding no new partitions. Also, in case a file is added, and the partition for it doesn’t already exist, it can take up to 15 minutes for it to be query-able, thus adding an unnecessary lag.
  4. You create a Glue Crawler that runs periodically, and updates tables and their partitions. This will again create a lag between when files are added and when they are query-able. Also, as the number of tables and partitions increases, the crawler will take longer to run and update tables.

Let’s look at the “file_type” partition. Given that “file_type” is a string field, it could have any values set based on the files that are sent. Going about this might be more complicated as compared to the date field.

  1. You can again have an S3-trigger or a Glue crawler just like the one we discussed for the date field, which would have the same issues.
  2. In case you are aware of the file types that are possible, you could add them when adding date-partitions. This would, however, lead to way too many extra partitions without data inside, which will in turn slow down Athena queries with time.

Given that I actually faced these issues while working with multiple Glue Tables where data was getting pumped from varied sources, Partition Projections was a godsend when I came across it. I’ll be explaining the basics of it, the options it provides, how to tackle the issue faced above, and what problem of mine I used Partition Projections for.

The Basics of Partition Projections

Partition Projections is not magic and it won’t resolve all your Glue Partitions problems. However, it will handle fields such as–

  1. Date- This one is a no-brainer. All Java-supported date formats listed on this page will work. Along with the format, you can specify the range of dates which you want the partitions for–where the end date can also be “NOW”. This means that as time passes, you can query data up to the latest date.
  2. Enums- AWS regions, valid data types, config codes, all of those can be set under enums. In this, you can put a comma-separated list of values which will be treated as possible partitions. Example- “A,B,C,D,E,F,G,Unknown”.
  3. Integers- This is similar to the date column, and you can add any range of integers here, 1 to 10, -500 to 500, etc. You can also add an interval just like the date column, which will define the final partitions.
  4. Injected Values- Possibly the most important and critical option offered by Partition Projection, this one comes in handy when you are not aware of all the possible partition values that might be created when you are setting up the Glue table. For example, we created a kinesis endpoint where we were creating the partitions based on event_type. This event_type would be sent by different APIs and would grow over time. The values cannot be known beforehand. Hence we went for injected partition projection. While Injected Partitioning can handle any values as such, the only thing to remember is that when you are querying data from a table with injected partitions, remember that you must provide a WHERE clause providing the injected_partitions’ value. You cannot query data from all the injected partitions together.

An example of Partition Projections in a real-life scenario

We were getting files from a Kinesis endpoint in this format:

s3://bucket_name/data_type=partner_data/event_type=event_a/date=08–09–2022/file_timestamp.csv

As is evident from the file key here, you can see that there are 3 possible partitions- “data_type”, “event_type” and “date”.

Of these, data_type can be “partner_data” or “internal_data”, so we can use an enum type partition projection here.

Next, we have event_type, which needs to be an injected partition projection, since we are not aware of the possible values that it may get beforehand.

Finally, we have the date partition for which we can use date type partition projection and use a range of 01–01–2022 to Now. We have selected the starting date based on when we set up our systems. You can also use ranges like NOW-3YEARS to NOW, 2018–01 to NOW+1MONTH, and more.

Given that we create tables using Cloudformation templates, we will be showing you how to add partition projections within it. In case you are adding the partitions manually through AWS-UI, you can still do the same manually.

GlueTable:
Type: AWS::Glue::Table
Properties:
CatalogId: !Ref AWS::AccountId
DatabaseName: !Ref GlueDataBase
TableInput:
Name: test_data
StorageDescriptor:
Location: !Sub 's3://${bucket_name}/'
Columns:
- Name: payload
Type: string
- Name: time_stamp
Type: timestamp
- Name: user
Type: string
InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
Compressed: false
SerdeInfo:
SerializationLibrary: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
Parameters:
serialization.format: '1'
BucketColumns: []
SortColumns: []
StoredAsSubDirectories: false
PartitionKeys:
- Name: data_type
Type: string
- Name: event_type
Type: string
- Name: date
Type: string
Parameters:
"projection.enabled": "true"
"projection.date.type": "date"
"projection.date.format": "dd-MM-yyyy"
"projection.date.range": "01-01-2020,NOW"
"projection.date.interval": "1"
"projection.date.interval.unit": "DAYS"
"projection.event_type.type": "injected"
"projection.data_type.type": "enum"
"projection.data_type.values": "partner_data,internal_data"
TableType: EXTERNAL_TABLE

The yaml file shared above is the Cloudformation template that we are using for the Glue table. In case your files are also saved in paquet format, you can use a similar cloudformation template. If they are csv, you might need to make some changes to this. Now, to focus on the partition projections, you need to look at the Parameters field. Even if you make the change manually via the AWS UI, you will need to set these same parameters. So lets go over each of them and understand what they mean.

Remember to have the Partitions defined first under PartitionKeys, in the order in which they are present in the file key. Any other order, or a missing partition key will throw errors. Next, if you look at the first parameter that is set–projection.enabled, this is what defines whether the Glue Table uses partition projections or the partitions that were added to it manually. Once this is set to true, any partitions that were added separately will not count.

Next, we have the projection.date.type, projection.date.format, projection.date.range, projection.date.interval and projection.date.interval.unit, all of which define the various details of our “date” partition. All the values and fields in this are self explanatory and in line with what we have discussed earlier.

After this, the “event_type” partition is set as “injected” type, which means that we could pass any value in this partition. And lastly, we have the enum type partition of “data_type”, which can take in the values- partner_data or internal_data. You could update this list later if required.

Things to keep in mind

Now that we have discussed the benefits of Partition Projection and how to use it effectively, let’s end with a few important pointers that must be kept in mind when using it:

  1. Table metadata size cannot be more than 1MB when GZIP compressed. Hence, do not add massive enums in partition projections.
  2. All dates mentioned in partition projection will be taken as UTC.
  3. When querying tables with injected partitions, you must have a WHERE clause mentioned their value. For example, for the table shown above, you would always need to query it as–SELECT * FROM test_data WHERE event_type=’dummy_event_value’. Any query without a WHERE clause with event_type will throw an error.
  4. “Enabling partition projection on a table causes Athena to ignore any partition metadata registered to the table in the AWS Glue Data Catalog or Hive metastore.”

To know more and apply Partition Projections to your Glue Tables, you can also read the Official AWS Documentation on the topic.

--

--