What Databricks table design patterns do I have to make queries more efficient?
TLDR: Leverage one of these strategies so that all queries read the least amount of data required to answer the query, which translates to faster execution times and lower costs. Also in the future with default storage, you may not have to think about this. I personally always prefer to pick one of these though as you know your data best and how people query it. Tools which optimize things automatically for you, I feel can be prone to error and I prefer to handle optimizations myself. Still, it is nice to know that Databricks is aiming to take this off the plate for those who prefer not to handle.
Introduction
For new users who haven’t used Databricks before, this article is crucial to understand as some decisions need to be made when the table is being built/designed. In general, the goal of table design is for every query to read the least amount of data possible to answer the query. This is super important because with Databricks, you can handle any size volume of data, and if you don’t design the table correctly, you might experience poor performance yourself, or worse, receive complaints from downstream users frustrated with long-running queries. These long-running queries are generally caused by non-optimal table design, which can lead to full table scans in the worst-case scenario.
The three main ways to design tables for efficient performance are partitioning, Z-ordering, and liquid clustering. The following blog post goes in-depth into how each of these works in a very technical manner. I personally don’t think everyone needs to understand the exact specifics of how it works, but rather have a rough understanding of how each works and any gotchas with each approach.
Partitioning
Partitioning is the original table design strategy and is probably still one of the most common to use. Partitioning in Delta tables is like organizing your data into smart, easily accessible folders. It’s a powerful technique that can significantly boost query performance and optimize storage by grouping similar data together based on specific columns. When you partition your Delta table, Databricks creates a directory structure that allows for efficient data skipping during queries, meaning you’ll only read the data you need. This not only speeds up your queries but also reduces costs, especially when working with large datasets in cloud environments.
Analogy: You can think of it like a dictionary. A dictionary is effectively partitioned by letter. If you are looking for the definition of a word that starts with ‘L’, you will never spend any time looking in the words under ‘A’. Partitioning works similarly for whichever column you decide to partition on; common ones include date or user_id.
Tips:
- Better to use on tables larger than 1TB as Z-ordering is more flexible and performant on smaller sizes.
- It’s better to be more generic than too specific, as over-partitioning can degrade performance.
- If using date, I prefer to use the actual date of the partition like ‘2024–07–29/data’ rather than multiple folders like ‘2024/07/29/data’.
- I find it cleaner and much easier to take advantage of the partitions than the alternative.
- You have to create the partition when defining/creating the table.
- Changing the partition in the future will require rewriting the entire data for this to take effect, so it’s best to think upfront about what these columns should be and communicate to all users of the table to use the partition.
Z-Ordering
Z-Ordering in Delta tables is a powerful optimization technique that enhances query performance by colocating related data within the same set of files. This method allows Databricks to skip irrelevant data more efficiently during queries, particularly for columns with high cardinality. By organizing data in this manner, Z-Ordering reduces the amount of data read, leading to faster query execution and improved resource utilization. When used correctly, it can significantly accelerate queries that filter on frequently queried columns, making it an essential tool for optimizing large datasets.
Analogy: Z-ordering is like collecting statistics on the columns that you specify. If someone gave you a book with 100 pages and told you to guess which page they were thinking of, and said it was in the 60s, you’d probably start at page 65 and guess if it was higher or lower from there, right? You wouldn’t start at page one and go up from there as you already understand roughly where that page should be since you basically have statistics, in the form of numbers, so you can read a lot fewer pages to be able to guess which it is.
Tips:
- Can be used on its own or in conjunction with partitioning for additional speed.
- By default, statistics for Z-ordering will only be collected on the first 32 columns in a Delta table.
- If you have a Delta table with more than 32 columns, make sure to manually specify the DDL so that the columns you care about are in the first 32 columns.
- Z-ordering effectiveness decreases with each column added; I wouldn’t advise Z-ordering on more than five columns and preferably keep it to about two to three.
- Z-ordering on too many things is like when everything is a priority, then nothing is a priority.
- If you later decide to add a column that is currently not being Z-ordered and is outside the first 32 columns, you will need to upgrade the table properties for that table to NAME and then specify which columns to collect statistics on.
Liquid Clustering
Liquid clustering is the latest optimization technique introduced at the summit in 2023. This innovative technique replaces traditional partitioning and Z-Ordering, making data layout decisions a breeze while turbocharging your query performance. With liquid clustering, you can redefine your clustering keys on the fly without the headache of rewriting existing data, allowing your data layout to evolve seamlessly as your analytic needs change.
Analogy: To be honest, I haven’t been able to really think of one, but I personally tend to think of it as a combination of the two approaches above, with extra math that makes it more efficient somehow. If anyone has one, please share!
Tips:
- The optimize operation for a liquid clustered table generally seems to take longer than a table that is partitioned.
- In addition, with a table that is partitioned, you have the explicit ability to control where you run the optimize by saying OPTIMIZE TABLE WHERE {PARTITION_VALUE} = VALUE.
With liquid clustering, all you can do is run OPTIMIZE, which claims to be incremental but in my experience seems to take much longer than an optimize with a partition predicate running on a similar-sized table (this is highly anecdotal and may or may not be anyone else’s reality).
Conclusion
When designing tables, make sure to select at least one of these strategies. As to which you choose, is up to you based on the business requirements of your data and the end user needs. The goal of this article though is to make sure you are aware of all the different options up front, before you build a table and then suffer from poor performance. Feel free to ask any questions!