Enhancing BigQuery Performance with Optimized Query

Learn how to boost your queries performance on BigQuery

Kevin Prawira
Blibli.com Tech Blog
6 min readMay 17, 2024

--

For Data Analysts, queries are an essential tool for extracting the necessary data for analysis and generating valuable insights. However, we often overlook query performance because we’ve already obtained the correct data. Nevertheless, slow queries can hinder efficiency and increase computing costs. Therefore, a crucial question arises: how can we generate the right data through the most efficient process?

The best solution to overcome slow queries is optimizing how we write them. By evaluating and refining the provided instructions, query performance can be significantly improved. In this article series, we will discuss various techniques and strategies for optimizing your BigQuery queries, enabling you to generate the required data in the fastest and most efficient manner.

Late Aggregation or Early Aggregation?

A fundamental principle before starting any task is to ensure that all necessary resources are readily available. For example, when aggregating data from two distinct populations, ensure that the data being merged is the smallest possible to process, which is the aggregated data from both populations. Avoid merging raw data first and then aggregating it, as this will increase computation by combining all possible combinations beforehand.

Let’s practice by solving this problem,

You want to calculate total duration spent per day on two platforms, says Android and iOS, where you have historical table like this,

What is the most suitable optimization approach?

A. Full outer join these two tables, so that I can have table for event_date , user_id , android_dur , ios_dur. Then I will just aggregate the duration by event_date

B. Calculate the summation for each table first, Android total time per day and iOS total time per day by aggregating by event_date. Then just join those two tables to get the final output.

Approach A is way computationally expensive than B since we have to join all of our records to compute one big table first before the aggregation. This can be improves by aggregating metrics on each platforms first then joining them, or in other words, we can prepare our join pieces by aggregating them first. That’s why in this case we have to use early aggregation — to avoid joining big tables.

But early aggregation is not always the case, we also can use late aggregation to avoid redundant aggregation. Just depends on your case and be wise 😃

Efficiency on JOIN clause

If you often work with a relational database, then you must be familiar with JOIN clause, where we can combine data from multiple table. But the problem is, we often ignore the efficiency of our JOIN, just focusing on our JOIN output. To minimize those error occurrences, we have to remember these wo things when using JOIN, which are (1) consider the join table order and (2) filter before JOIN.

Consider JOIN table order means that place your largest table as the left side of the JOIN, followed by the smallest and the decreasing size. The standard SQL query optimizer can determine which table should be on which side of the join, but we still recommend that you order your joined tables appropriately.

Filter before JOIN concepts are still remained the same like our previous discussion, ensure your table is the smallest table possible to be executed. Hence the computation of your table can be the simplest as well. It’s okay to use some common table expression (CTE) as well to make you query well documented and efficient.

For example, you want to check the age of your customer who has order on January within these two tables: order and user

How would you write the query?

A. Left join order to user by user_id, then just filter the order_date to return only transaction in January

B. Filter order to get only transaction in January, then left join that filtered table to user by user_id

Most people maybe will do the option A for the sake of simplicity right? We don’t need to prepare some query, just straight left join then filter it. But computationally wise, this is a greedy approach since we join all the order data including the month that actually we don’t need — wasteful.

The better one is to prepare some CTE / subquery first to return the order transaction that happened on January only — which is all we need then left join it to get the information we need. More effort, but the result is way more neat and cheaper in terms of computationally 👍

Expression Orders

It’s quite surprising for me to know this method at first, because it’s actually very easy to do but I saw people (including me 😅) often take this as granted. It’s about the expression order that BigQuery do care about. BigQuery will execute your WHERE clause sequentially since they assume your order expression is already the best. You have to put the most strict filter at first, followed by the less strict and so on. This approach will make BigQuery find the data faster since they will ignore useless data and will focus to apply the filter the subset only.

Imagine you have data about what a user do in his sessions, having information about datetime, user_id, session_id, event_name, label.

You want to find all data from that user when he did select_promotion with label starting with the word “banner”.

What is the better approach to query this?

A. SELECT * FROM event WHERE label LIKE ‘banner%’ AND event_name = ‘select_promotion’

B. SELECT * FROM event WHERE event_name = ‘select_promotion’ AND label LIKE ‘banner%’

Yes, answer B is the better one since it applied the right expression order. We have to use LIKE filter — that has to evaluate all records, so we have to place that LIKE filter at last, while the other filter at first to decrease number of records scanned by LIKE clause.

It’s better to search label that starting with banner from select_promotion event only (2 records), rather than search labelstarting with banner from all records (6 records) then select only the select_promotion event right, isn't it?

This logic is also useful when you want to use exact matches filter only. Let’s say you want to capture records for user that view banner_3. Since labelis more specific data than event_name, you can use filter label= banner_3 AND event_name = view_promotion, rather than the opposite. Search the view_promotion event from records that have label = banner_3 (1 record) rather than search label = banner_3 from all view_promotion event (4 records).

Blibli Implementation

As a fast paced e-commerce platform and to ensure continued scalability and cost-efficiency, Blibli embarked on a journey to optimize our queries. By leveraging some optimization method like we discussed above, we were able to significantly reduce our query slot seconds up to 80%!

This optimization not only saved money but also improved the performance of data analytics, allowing Blibli to gain deeper insights from our data faster with the power of proactive optimization strategies.

If you want to learn more how to manage and monitor you BigQuery slot usage and consumption, you can refer to our other article as well! Thank you! See you in another article! 😁 👋

--

--