Glue Partition Indexes: Speed Up Your Queries

Published in

CodeX

3 min readJun 16, 2022

Introduction

I talked in a previous post about migrating from a Hive metastore to a Glue metastore and some of the challenges we faced in doing so. Even after this migration, we noticed that our queries were running longer in Databricks than they were in Qubole. Why would that be the case? It’s a result of datasets with many partitions. Since that’s generally unavoidable, how can we solve it?

Partition Indexes

Glue partition indexes are an additional component you can add to your Glue tables to speed up query time. Essentially, partition indexes should target your most common partition columns that you use in queries so that when they are added to filters going forward, you reap the benefits. For those who use IaC, there’s a Terraform module for partition indexes so that these can be applied to tables (even after creation time, although I was facing issues with encrypted partitions errors when trying this way; I was able to properly create partition indexes by recreating the table and adding the index before the partitions were added).

Caveats

Worth nothing is that if you specify multiple columns in an index, the presence of the first column in that list is critical to take advantage of the partition index. For example, if your partition index is [country, year, month, day], make sure country is present (and make sure your index is structured so the first column is the most critical one).

Also worth noting is that only certain operations are allowed on queries if you want to actually use the index properly. For example, using the IN operator does not actually activate the index compared to other operations. All of these caveats can be found in AWS’s documentation.

Impact

I can definitely update this post with actual comparisons as we continue to do our testing. One thing that I couldn’t find in the documentation that I’m curious about is whether having an index on a table will carry over those benefits to views that are based on said table. That’s how we generally structure our data (as the views have more readable column names than the tables), so this would not be as advantageous if it doesn’t carry forward.

Either way, we look forward to taking advantage of partition indexes. It definitely was disheartening to migrate to a tool like Databricks and see our new metastore implementation result in reduced performance for ad-hoc querying, so hopefully this helps stabilize that area, as we continue to focus on strengthening our best practices elsewhere.

Conclusion

A similar optimization (on the Athena side of things) is partition projection. I have never tried this out myself as we don’t generally use Athena for any analysis, but it seems to be a great improvement for those who do. Having automated partition management and being able to use date ranges that actually work from a partitioning perspective are very essential, and are worth checking out for those who do use Athena in their day-to-day work.

Otherwise, consider using partition indexes for your Glue needs. Hope this works for you!