dataxu’s journey from an Enterprise MPP database to a cloud-native data warehouse, Part 2
This is part 2 of a series of blogs on dataxu’s efforts to build out a cloud-native data warehouse and our learnings in that process. See part 1 here.
Once we determined the architecture of the cloud-native data warehouse as discussed in part 1 of this series, we set out to select a query engine.
dataxu was an early participant in the preview of the Amazon Athena Service, and once the service went GA, we became an early adopter. With Athena, our analytical users are able to run complex and interactive SQL queries over the data stored on S3 with excellent performance and cost efficiency, without the necessity of managing a cluster. After months of testing, Athena has been incorporated as our default query engine in the cloud-native warehouse.
As with any SQL engine, it is critical to learn a few things about the engine in order to write performant SQL. AWS published an excellent blog Top 10 Performance Tuning Tips for Amazon Athena that discusses this topic. But to make things easier, here at dataxu, we distilled these tips down to 3 key recommendations:
- Select only the columns you need, particularly if the underlying data is in columnar format like Parquet or ORC. Never write “ select * ”.
- Always include a filter on the partition column in the where clause. A partitioned table is almost certainly a large table, without a filter on the partition column it will result in a full scan of the data.
- Mind the join order. On multi-table joins, start with the table that has the most rows to the least, from left to right.
We would also like to share a SQL rewrite technique that would significantly improve the query performance and is most applicable to Star schema, a common approach to data organization.
Let’s start with a typical schema in ad tech, where we have a fact table of impressions and a few dimension tables, like campaigns, creatives, algorithms, etc. A sample schema looks like the example below*:
Note the impressions records each individual event, well over hundreds of billions records annually.
One query asks for a break down of the number of impressions and spend during the second half of 2017 by month, creative_type, algorithm_type and advertiser_name.
Most of us would write SQL like the following:
The query above only selected the columns needed, has a filter on the partition column created_date, and ordered the tables in the number of rows. This is all well and good, however, with the amount of data in the impressions table, the query timed out on Athena, even after we raised the timeout setting from 30 minutes to 1 hour.
If you look at the query, it is a massive table with three adjoining small dimension tables. Due the sheer amount of data in impressions, the join is extremely expensive. I call this type of SQL “Join-GroupBy” approach. However, if we rewrite the query, first run a groupBy just on the fact table “impressions” for the join keys and then join with dimensions, followed by another groupBy if necessary, the data size involved in the join would be significantly reduced. I call this rewritten approach “GroupBy-Join-GroupBy”.
The rewrite query looks like this:
This query finished on Athena in 15 minutes, a vast improvement from the 1 hour timeout we encountered with the first SQL approach.
Why the drastic difference? Simple. GroupBy first significantly compresses the joined columns. In our specific example, the total number of rows in impressions vs. total number of rows after GroupBy is at a ratio of 3000:1. This means that the join is much more efficient and Athena is able to complete it much faster.
Is GroupBy-Join-GroupBy always better than Join-GroupBy? Not necessarily.
GroupBy-Join-GroupBy involves an extra GroupBy compared to Join-GroupBy, so there must be a significant reduction of the dataset after the first GroupBy to offset the cost. Obviously, you may have a huge variance due to the types of data and organization, yet as a rule of thumb, if the compression ratio is at least 1 order of magnitude, the extra GroupBy is usually worthwhile. It is particularly worthwhile if there is complex logic on the dimension columns, i.e., case statement or regular expression on creative_type, algorithm_type as in the above example. The compression on the fact table will reduce the invocations on case statement and regular expression as well.
We have had similar experiences on PostgreSQL, so it could also be applicable to other SQL engines and believe it may be worthwhile to try.
Next time, we will do a deep-dive on separation of metadata and data.
Please post your feedback in the comments — how do you handle complex SQL queries and what are you techniques to improve performance? If you found this post useful, please feel free to “applause” and share!