JSON on S3-Athena : an optimization experience

Published in

affinityanswers-tech

8 min readJun 1, 2022

In Data Engineering, querying your data is way more than just writing a query. There’s always a systematic way to acquire the same results with a better approach that can optimize performance and cost.

Query Optimization

Query optimization is the process of selecting an efficient execution plan for evaluating the query. Once the user submits a query there’s further parsing of the query and, it is passed to the query optimizer which generates different execution plans to evaluate parsed query and selection of plan with least estimated cost. The goal of query optimization is to reduce the system resources required to execute a query and provide the result set faster.

Following diagram shows the query engine’s process called query execution plan that begins when user submits a query. Read more.

Having a basic idea on order of execution for query statements can be helpful.
We can also see a detailed information on query execution plan for the query using EXPLAIN & EXPLAIN ANALYZE statement. So, how to optimize the queries is the question., here are some ways to optimize a query :

Choosing the right Aggregate and Scalar functions as per the query requirement.
Selecting only required fields instead of * in the SELECT statement.
Avoiding multiple joins and if this doesn’t suffice your result, change the database table schema in a way that requires less joins.
Use of (database) partitions to filter the targeted data, this helps reduce the amount of data scanned per query.
Use of subqueries when required, as writing a query without nesting causes a massive data explosion, nested query is a tool for performing operations in multiple steps.
Optimized use of JOINS, ORDER BY and GROUP BY clauses.

Problem Statement

We use Superset (visualization tool) for analyzing a dataset by querying on AWS Athena (the data being stored in AWS S3). The query time most often exceeded the threshold set by Superset resulting in the charts on Superset not getting updated. Not withstanding the frustration of the visualization tool not updating the charts (which our Data Analysts wanted desperately), the cost was also prohibitive as Amazon Athena’s pricing is based on the bytes scanned.

We had to step back and think about the reason for poor query performance despite performance tuning and here’s where I got to know that the problem is with how the data is stored in S3.

Data Compression

Our data in S3 is in JSON format and was uncompressed. Hence, our team decided to make a compression (gzip) so as to reduce the cost of queries by storing the S3 data in a compressed format. Compressed JSON removes the key-value pair of json’s encoding to store keys & values in a separate parallel arrays.

Following table shows the comparison of storage for uncompressed & compressed data. These metrics are based on one partition (our data has many partitions).

JSON storage -[ Uncompressed vs Compressed ]

Well, this time it was ecstatic to see that the data scanned by Athena was reduced along with the cost. This solved the problem if what I thought. Well, not completely, as it was surprising to see that Superset was still unable to load the visuals after running through queries for max time available and results in timed-out errors. What again now, the query was tuned, data was reformatted and why were the queries failing to fetch the results? Again the only way is to go-back, re-check what more could be done or understand how the data is present in S3 (as S3 is from where the data gets queried).

Optimum File size

Upon little research, we came to know that there were numerous (more than 7000 files per partition) smaller files in S3. Finding out if this could cause a problem or not is the question? And, it is, as mentioned in AWS Blog that states “optimum file size is crucial for query performance” and also mentions various performance tuning techniques for Athena. Coming to the optimum file size, now it’s quite clear that our individual S3 files are very small with less data and it results in execution engine spending more time listing directories, opening and closing files, managing huge metadata, setting up to transfer results and all the other steps involved in this process.

Possible solution

Solution is to avoid storing smaller S3 files in huge numbers and instead have less files with more data in individual files. Not clear about what this means? It’s just that we need to combine the smaller files so as to reduce numerous small files to few large files which makes query execution easier for Athena. Now, it’s likely for us to assume that large files can improve performance and it’s a NO, because the files can neither be too small nor too large and it should be of optimum size at which it doesn’t degrade the performance.

Attempt to merge files

We’ve planned to merge multiple S3 files to fewer files. How I started to merge files was by downloading the files locally and merging the json from ‘n’ no.of files to a single file. At the same time, it’s equally important to make a check if the merged data is fetching the same results as previous data. This becomes a tedious task to download a bunch of files to our local machine which is a storage issue and manage files on our own without manual errors.

CTAS Query

At this point, I came across an alternative and best approach to get this process done easier. Combining S3 files can be done using CTAS query, this query creates a new table in Athena from the results of a SELECT statement from another query. Athena stores data files created by CTAS statement in a specified S3 location.

Let’s take a look at how this CTAS query works
Assume that S3 has data that’s partitioned by date collected where each date has multiple smaller files which we want to combine using this CTAS query. Say, we have ’n’ no.of date partitions and we want the same partitions for merged data as well. We can use the select statement for each partition date and merge the files. Now, let’s go with one partition date, this partition contains 2000 files and there’s a table created for this in Athena.

We use a SELECT statement to select all of the data from one date partition by specifying the condition through WHERE clause. This retrieves all data from 1000 files. This selected data will be combined and stored in a new external location we specify along with a selected format. Do you think that we have a single file now in this new location as query merged the data ? No, it’s not one single file, but by default this query distributes data into 30 files after it’s merged (no.of files can differ based on size of data as well).

Overview on solution and performance

Our data which was in smaller sized files got merged with this query and we fulfilled the important requirement called optimum file size for best query performance. Voila! The most interesting part of the merged data is that there are very few files and the storage required for combined data is reduced by 90%. Hence, as the data got enormously reduced, Athena can read S3 files faster and data scan is also very less.

Let’s test performance with a simple query that counts no.of records in data :

Following table shows the comparison of performance metrics for above queries:

Query performance -[ Numerous vs Few files]

Our data has 16 partitions based on date and each partition date has >10k files. See the table below for before and after the files have been combined :

Everything now should be working as expected. Finally our data Analysts were no longer frustrated with not able to see the data on Superset and we in Data Engineering were equally happy that the AWS cost needed to make this faster was a fraction of what it was when it was slow.

Different formats experimented

The data now is in compressed JSON format. As a final check, I experimented by converting this compressed JSON to parquet format to see the difference. Parquet format again slightly consumes less space than JSON and not a considerable amount of difference though. In terms of storage for different formats, our data is like json > json.gz > parquet.

Great, the takeaway from this learning is that there’s always a different problem and it’s we who need to find the solution and fix accordingly, as different aspects might be involved and every case differs. So, it’s important to follow the best practices while we start working on something, whether be it storing data, querying data or anything else for that matter. Or, the problem might get more severe which itself leads us to find a solution and resolve it.

Bonus Read

Well done !! On getting this far.

Not yet clear, how the discussed case in this blog can help you ?
I’ll take you through an example case as a practice.

Let’s assume some hypothetical case from our situation. So, as you know that at the first place we’ve compressed files from JSON resulting in gzip compressed JSON which reduces file size from x bytes to (x-y), where ‘y’ is no.of bytes reduced that scans less data which in turn incurs less cost. Now this makes us understand that we can follow this to compress our data. So, you want to compress your data too and expect the reduced file size.

// Compressing normal file - [Reduces file-size]
x bytes  >>  (x-y) bytes// Compressing very smaller file - [Increases file-size]
x bytes  >>  (x+y) bytes

But, it turns out that your file size increased from x to (x+y) bytes after conversion. That means, your file size increased after compression. Why do you think so (give it a guess before you continue). Because, the file might be too small to be compressed. Maybe, it’s right to merge the data files together first and then compress or something else that suits your situation. Try compressing a small file and you’ll see it results in increased size.
That’s how the problem gets different in an unexpected way for each case and needs some observation to get a problem-specific solution.