How to improve the performance of BigQuery queries by optimizing the schema of your tables

Three tips to improve BigQuery performance at the storage level: Nested fields, Geography types, and Clustering

Lak Lakshmanan
Google Cloud - Community
6 min readSep 21, 2019

--

In this article, I take a real table and change its schema in a lossless way so as to improve the performance of queries on that table.

Optimize how your data are stored to achieve better query performance. Photo by Annie Spratt on Unsplash

Queries to optimize

To illustrate that the table schema is improved, we have to measure the performance on realistic queries on real-world datasets. I will use observations of inhalable Particulate Matter (PM10) from the United States Environmental Protection Agency (EPA). The EPA PM10 hourly dataset is available as part of the BigQuery public dataset program.

This is originally a flattened table — basically, there is a row for every hourly observation. The dataset is relatively small (1.4 GB, 40m rows), so the queries should fit very well within the free monthly quota (1 TB). Because it is a small table, though, the improvements won’t be as dramatic as they would have been on larger tables.

Let’s say that we want to find how many instruments per county we have PM10 observations in 2017. The query is:

This query took 2.4 sec and processed 1.3 GB.

For the second query, let’s say that we want to find the maximum PM10 reading in the City of Columbus, Ohio year-by-year. The city polygons are in another public dataset, and so we will join them:

This took about 4 minutes, processed 1.4 GB, and yielded PM10 readings in Columbus over the years.

These are the two queries I’ll use to demonstrate the optimizations.

Tip #1: Use Nested fields

The EPA hourly data is in a table each of whose rows is an hourly observation. This means that there’s now a lot of repeated data about stations, etc. Let’s combine all the observations on a single day from the same sensor into an array (see the ARRAY_AGG below) and write this to a new table (make a new dataset named advdata first):

The new table has fewer rows (1.7 million) but it is still about 1.41 GB because we have not lost any data! The difference is that we are storing the observed values as arrays within a row. So, the number of rows has been cut down to 1/24 of the original number.

Querying for instruments by county is now:

The query now takes 0.7 sec (3x faster) and processes only 56 MB (24x cheaper). Why is it less expensive? Because there are 24x fewer rows (remember that we aggreggated hourly measurements into a single row), and so a table scan has to process 24x less data. The query is faster because it needs to process fewer rows.

But what if you really need to process hourly data? Since the transformation to use arrays is not lossy, we can still query for the maximum hourly PM10 observations over the years in Columbus. That query now requires an UNNEST in the FROM clause but is otherwise identical:

This query still takes 4 minutes, but it processes only 537 MB. In other words, storing the data as nested fields (arrays) has made the query 3x less expensive! This is curious. Why does the data read go down? Because there are rows (those outside Columbus) that we don’t need to read the array data for. But the computations (max, extract year, ST_Within) are the bulk of the overhead for this query and the number of rows those are carried out on is the same, so the query speed doesn’t change.

Tip #2: Geography Types

Can we improve the computation through storing the data better? Yes!

Instead of constructing the geographic point from the longitude and latitude each time, it is better to store the latitude and longitude as a geographic type. The reason is that creating a geographic point with ST_GeogPoint() is actually a somewhat expensive operation that involves finding the S2 cell that holds the point (it’s even more expensive if you are trying to create more complex shapes like polygons):

The first query is the same because we don’t use latitude and longitude in the query. The second query can now avoid creating a ST_GeogPoint:

It takes 3.5 minutes and processes 576 MB, i.e. 6% more data (the geography type for a point uses more than what two floats would take) for a 12.5% speedup in query performance.

Tip #3: Clustering

Notice that we use the time quite extensively. What if we ask BigQuery to store its tables in such a way that all equal values of that field are held in adjacent rows? This way, if our query every filters on some aspect of the time, BigQuery doesn’t have to do a full table scan. Instead, it can read just part of the table.

Because you can only create 2000 partitions at a time, I decided to partition by a dummy date field (this is okay because by clustering, we are forcing queries to use partitions):

Now, let’s take the first query:

0.8 seconds, 43MB! No difference.

How about the second?

The query now takes just 20 seconds and processes 576.4 MB, a 10x speedup. This is because we clustered the table by month, and we are filtering by month, and this allows BigQuery to organize the data more efficiently.

Enjoy!

--

--

Lak Lakshmanan
Google Cloud - Community

articles are personal observations and not investment advice.