Serverless Analytics, Part 5: Query Optimization

JV Roig
6 min readNov 7, 2022

--

This is Part 5 of a 5-part series about getting started with serverless analytics. In case you missed them, you can catch up on the earlier parts of the series through the links at the end of this article.

The basic serverless analytics architecture in this series can help you jumpstart your analytics capability. It provides you with the ability to gain insights from terabytes of collected data, without requiring a huge opex commitment (i.e., not requiring you to pay for a huge monthly bill for that analytics solution). This is especially useful while you are still experimenting and getting started, and even as you have to deal with terabyte-level data sizes.

Optimizing your Athena queries

After the first 4 parts of this series, we now get to Athena optimization — how to make your queries faster and cheaper.

There are far more Athena optimization options than I can really explain in this one short article. But since this article series is aimed more towards those just getting into either serverless (meaning, you’re just starting to get to know this entire serverless paradigm), or just serverless analytics in particular, we’ll focus on what I believe are the most relevant optimization tips. Hopefully, you’ll find that these few optimization tips are relatively easy to implement for your own use case, but have high impact.

Optimization #1: Always partition your data!

This should sound familiar to you. We talked about partitioning data in Part 2.

This is by far the most critical and most impactful optimization step.

Sure, you can use Athena to query an S3 bucket without bothering with partitioning. And if your bucket only contains a few GBs of data, that’s probably neither expensive nor slow. So it’s seductive, but dangerous. Without any sort of partitioning on your data, your analytics solution will not scale. It’ll work fine in your sandbox with limited data, but the moment that gets to production with 10x more users and 100x more data, you will have very unhappy users, and a very unhappy Finance Dept once they get the monthly bill.

For most generic business data, you can start out with a typical year/month/day partition strategy. That’ll provide sufficient partitioning for date-based queries against your data. When your serverless analytics solution gets traction, and you get more information about other non-date-based queries that your users execute, you can then refine your partitioning strategy for that particular set of data.

Remember that your partitioning strategy doesn’t have to be one-size-fits-all. For example, if you collect three different sets of data that belong to three different departments, each of those sets of data can have its own partitioning strategy. To illustrate, here’s some possible scenarios:

  • You have log data from IT, and you partition it by system/year/month/day to allow your IT users to execute date-based analytical queries per independent system that produces those logs.
  • You have order data from Sales, and you partition it by year/month/day/branch to allow Sales executives and analysts to execute date-based queries against historical sales data to look at trends and patterns — and sometimes these are across all branches, while at other times these are queries for specific branches.
  • You collect IoT sensor data from a collection of IoT devices scattered across your smart building, and you partition it by device_id/date so the building administration team can examine IoT sensor data per device over time.

Optimization #2: Compressed and splittable!

Athena queries are most efficient when they operate on splittable compressed files.

Splittable means when Athena only needs to read a few specific columns from a data file to satisfy a query, then it can do so without reading the entirety of the file. It doesn’t have to read the rest of the unnecessary data. You get less Athena scanning costs, and faster processing as you skip unnecessary reads.

Compressed means exactly as you would expect — the data files are compressed so as to save disk space, which lessens our data storage costs (S3) and query scanning costs (Athena). We aren’t after maximum compression, of course, otherwise the needed decompression will slow down our query processing. Instead, we want just enough compression with minimal decompression overhead.

For these reasons, If you are still unsure (or have no strong preference or need for any specific format), you can just happily default to using Parquet + Snappy compression as the output format of your ETL jobs. The combo of Parquet + Snappy gives you the benefits of compressed and splittable files.

Optimization Tip #3: Query tuning

If your data is partitioned properly and transformed into an efficient splittable+compressed format, you can also consider some easy query optimization tips.

First: Whenever you have queries that are sorting results using an ORDER BY clause, then use the LIMIT clause whenever you can. For example, if you only really need the top 1M results while your data has tens of millions or more, not using the LIMIT clause will simply waste time and money.

Second: Optimize your joins by specifying the larger table on the left side, smaller table on the right side (i.e., “…FROM big_table, small_table WHERE big_table.primary_key = small_table.primary_key”). This matters because Athena distributes the table on the right to worker nodes. You want that to be the smaller table, so that less memory is used and the query can run faster. If you are operating on three or more tables, try to join the largest table with the smallest one first (smallest one still on the right), so that you cut down the largest table as much as possible before it gets joined with the rest of your tables. (If you are a Redshift user, then you are already familiar with data distribution to workers during query execution; this is just like that. Data distribution is a key part of accelerating Redshift queries, and understanding how it works in Athena likewise helps you speed it up.)

Third: Use approximate aggregation functions whenever you can, instead of the exact aggregation function — for example, approx_distinct(my_data) instead of count(distinct my_data).If you absolutely need the exact count, then of course this won’t apply, but when your queries are concerned with ballpark figures and you can take a standard error of 2.3%, using approximate aggregation functions can give you a modest but significant performance speedup. 20–30% faster queries would not be surprising here. For more info on approximate aggregation functions you can use, see this page from the PrestoDB documentation.

Fourth: Don’t do “SELECT * …”, be specific with the columns that you need. Remember that since our processed data files are ideally in compressed+splittable format, this means Athena can actually reduce the amount of data being processed by simply ignoring columns from the data files that aren’t necessary to the query. Avoid defaulting to a “SELECT *…” whenever you can, and be very specific about the columns that you want.

Optimization Tip #4: UNLOAD

This last tip doesn’t require changes to your actual query. Instead, you just have to wrap your query inside an UNLOAD statement. For example, instead of executing a query like this:

SELECT col1, col2, col3, col4 FROM my_table LIMIT 20000000

… we can do this:

UNLOAD (SELECT col1, col2, col3, col4 FROM my_table LIMIT 20000000) to ‘s3://mybucket/myfolder/’ with (format=’PARQUET’)

What happens when you use UNLOAD is that Athena splits the result into multiple files in S3, instead of writing one giant file. Writing that one giant file (say, tens of terabytes in size) can take significant time, and UNLOAD can make it faster by effectively parallelizing the writing of the output.

UNLOAD does have some caveats, such as the S3 destination location being empty (in the example above, this means myfolder must not contain data), otherwise UNLOAD will not write to it. There’s also a max partition limit of 100. When you are ready to try to use UNLOAD, take a quick look at its documentation page to be sure you are covered.

Wrap up

Congratulations! If you’ve reached this far, you not only have your own cheap and scalable serverless analytics solution to crunch through terabytes of data, you now also understand how to optimize the queries that you and your users need to make.

If you’ve only been following along so far, but haven’t attempted hands-on trials in your own sandbox, well, now’s the best time to do so. With this series complete, you can easily go and review each part as needed while you are building your own serverless analytics solution in your sandbox.

ICYMI, here are the other parts of the Serverless Analytics series:

If you liked this article, please follow me to receive notifications for new articles, and click the clap button a few times so that the Medium algorithm will know to give you more of this type of content. Thanks!

--

--

JV Roig

Multi-cloud Sol Arch w/21 certs across AWS, GCP, Azure, Alibaba & Oracle