How to improve the performance of BigQuery queries by optimizing the schema of your tables

Three tips to improve BigQuery performance at the storage level: Nested fields, Geography types, and Clustering

Published in

Google Cloud - Community

6 min readSep 21, 2019

In this article, I take a real table and change its schema in a lossless way so as to improve the performance of queries on that table.

Optimize how your data are stored to achieve better query performance. Photo by Annie Spratt on Unsplash

Queries to optimize

To illustrate that the table schema is improved, we have to measure the performance on realistic queries on real-world datasets. I will use observations of inhalable Particulate Matter (PM10) from the United States Environmental Protection Agency (EPA). The EPA PM10 hourly dataset is available as part of the BigQuery public dataset program.

This is originally a flattened table — basically, there is a row for every hourly observation. The dataset is relatively small (1.4 GB, 40m rows), so the queries should fit very well within the free monthly quota (1 TB). Because it is a small table, though, the improvements won’t be as dramatic as they would have been on larger tables.

Let’s say that we want to find how many instruments per county we have PM10 observations in 2017. The query is:

SELECT
 pm10.county_name,
 COUNT(DISTINCT pm10.site_num) AS num_instruments
FROM 
  `bigquery-public-data`.epa_historical_air_quality.pm10_hourly_summary as pm10
WHERE 
  EXTRACT(YEAR from pm10.date_local) = 2017 AND
  pm10.state_name = 'Ohio'
GROUP BY pm10.county_name

This query took 2.4 sec and processed 1.3 GB.

For the second query, let’s say that we want to find the maximum PM10 reading in the City of Columbus, Ohio year-by-year. The city polygons are in another public dataset, and so we will join them:

SELECT
  MIN(EXTRACT(YEAR from pm10.date_local)) AS year
  , MAX(pm10.sample_measurement) AS PM10
FROM 
  `bigquery-public-data`.epa_historical_air_quality.pm10_hourly_summary as pm10
CROSS JOIN
  `bigquery-public-data`.utility_us.us_cities_area as city
WHERE
  pm10.state_name = 'Ohio' AND
  city.name = 'Columbus, OH' AND
  ST_Within( ST_GeogPoint(pm10.longitude, pm10.latitude), 
             city.city_geom )
GROUP BY EXTRACT(YEAR from pm10.date_local)
ORDER BY year ASC

This took about 4 minutes, processed 1.4 GB, and yielded PM10 readings in Columbus over the years.

These are the two queries I’ll use to demonstrate the optimizations.

Tip #1: Use Nested fields

The EPA hourly data is in a table each of whose rows is an hourly observation. This means that there’s now a lot of repeated data about stations, etc. Let’s combine all the observations on a single day from the same sensor into an array (see the ARRAY_AGG below) and write this to a new table (make a new dataset named advdata first):

CREATE OR REPLACE TABLE advdata.epa ASSELECT
  state_code
  , county_code
  , site_num
  , parameter_code
  , poc
  , MIN(latitude) as latitude
  , MIN(longitude) as longitude
  , MIN(datum) as datum
  , MIN(parameter_name) as parameter_name
  , date_local
  , ARRAY_AGG(STRUCT(time_local, date_gmt, sample_measurement, uncertainty, qualifier, date_of_last_change) ORDER BY time_local ASC) AS obs
  , STRUCT(MIN(units_of_measure) as units_of_measure
         , MIN(mdl) as mdl
         , MIN(method_type) as method_type
         , MIN(method_code) as method_code
         , MIN(method_name) as method_name) AS method
  , MIN(state_name) as state_name
  , MIN(county_name) as county_name
FROM `bigquery-public-data.epa_historical_air_quality.pm10_hourly_summary`
GROUP BY state_code, county_code, site_num, parameter_code, poc, date_local

The new table has fewer rows (1.7 million) but it is still about 1.41 GB because we have not lost any data! The difference is that we are storing the observed values as arrays within a row. So, the number of rows has been cut down to 1/24 of the original number.

Querying for instruments by county is now:

SELECT
 pm10.county_name,
 COUNT(DISTINCT pm10.site_num) AS num_instruments
FROM 
  advdata.epa as pm10
WHERE 
  EXTRACT(YEAR from pm10.date_local) = 2017 AND
  pm10.state_name = 'Ohio'
GROUP BY pm10.county_name

The query now takes 0.7 sec (3x faster) and processes only 56 MB (24x cheaper). Why is it less expensive? Because there are 24x fewer rows (remember that we aggreggated hourly measurements into a single row), and so a table scan has to process 24x less data. The query is faster because it needs to process fewer rows.

But what if you really need to process hourly data? Since the transformation to use arrays is not lossy, we can still query for the maximum hourly PM10 observations over the years in Columbus. That query now requires an UNNEST in the FROM clause but is otherwise identical:

SELECT
  MIN(EXTRACT(YEAR from pm10.date_local)) AS year
  , MAX(pm10obs.sample_measurement) AS PM10
FROM 
  advdata.epa as pm10,
  UNNEST(obs) as pm10obs
CROSS JOIN
  `bigquery-public-data`.utility_us.us_cities_area as city
WHERE 
  city.name = 'Columbus, OH' AND
  ST_Within( ST_GeogPoint(pm10.longitude, pm10.latitude), 
             city.city_geom )
GROUP BY EXTRACT(YEAR from pm10.date_local)
ORDER BY year ASC

This query still takes 4 minutes, but it processes only 537 MB. In other words, storing the data as nested fields (arrays) has made the query 3x less expensive! This is curious. Why does the data read go down? Because there are rows (those outside Columbus) that we don’t need to read the array data for. But the computations (max, extract year, ST_Within) are the bulk of the overhead for this query and the number of rows those are carried out on is the same, so the query speed doesn’t change.

Tip #2: Geography Types

Can we improve the computation through storing the data better? Yes!

Instead of constructing the geographic point from the longitude and latitude each time, it is better to store the latitude and longitude as a geographic type. The reason is that creating a geographic point with ST_GeogPoint() is actually a somewhat expensive operation that involves finding the S2 cell that holds the point (it’s even more expensive if you are trying to create more complex shapes like polygons):

CREATE OR REPLACE TABLE advdata.epageo ASSELECT 
  * except(latitude, longitude)
  , ST_GeogPoint(longitude, latitude) AS location
FROM advdata.epa

The first query is the same because we don’t use latitude and longitude in the query. The second query can now avoid creating a ST_GeogPoint:

SELECT
  MIN(EXTRACT(YEAR from pm10.date_local)) AS year
  , MAX(pm10obs.sample_measurement) AS PM10
FROM 
  advdata.epageo as pm10,
  UNNEST(obs) as pm10obs
CROSS JOIN
  `bigquery-public-data`.utility_us.us_cities_area as city
WHERE 
  city.name = 'Columbus, OH' AND
  ST_Within( pm10.location, city.city_geom )
GROUP BY EXTRACT(YEAR from pm10.date_local)
ORDER BY year ASC

It takes 3.5 minutes and processes 576 MB, i.e. 6% more data (the geography type for a point uses more than what two floats would take) for a 12.5% speedup in query performance.

Tip #3: Clustering

Notice that we use the time quite extensively. What if we ask BigQuery to store its tables in such a way that all equal values of that field are held in adjacent rows? This way, if our query every filters on some aspect of the time, BigQuery doesn’t have to do a full table scan. Instead, it can read just part of the table.

Because you can only create 2000 partitions at a time, I decided to partition by a dummy date field (this is okay because by clustering, we are forcing queries to use partitions):

CREATE OR REPLACE TABLE advdata.epaclustered 
PARTITION BY dummy_month
CLUSTER BY state_name, date_local
AS 
SELECT 
*, 
CAST(
CONCAT(CAST(EXTRACT(YEAR from date_local) AS STRING), "-", 
       CAST(EXTRACT(MONTH from date_local) AS STRING), "-01") AS DATE) AS dummy_month
FROM advdata.epageo

Now, let’s take the first query:

SELECT
 pm10.county_name,
 COUNT(DISTINCT pm10.site_num) AS num_instruments
FROM 
  advdata.epaclustered as pm10
WHERE 
  pm10.state_name = 'Ohio'
GROUP BY pm10.county_name

0.8 seconds, 43MB! No difference.

How about the second?

SELECT
  MIN(EXTRACT(YEAR from pm10.date_local)) AS year
  , MAX(pm10obs.sample_measurement) AS PM10
FROM 
  advdata.epaclustered as pm10,
  UNNEST(obs) as pm10obs
CROSS JOIN
  `bigquery-public-data`.utility_us.us_cities_area as city
WHERE 
  pm10.state_name = 'Ohio' AND
  city.name = 'Columbus, OH' AND
  ST_Within( pm10.location, city.city_geom )
GROUP BY EXTRACT(YEAR from pm10.date_local)
ORDER BY year ASC

The query now takes just 20 seconds and processes 576.4 MB, a 10x speedup. This is because we clustered the table by month, and we are filtering by month, and this allows BigQuery to organize the data more efficiently.

Enjoy!