Analytics Vidhya
Published in

Analytics Vidhya

BigQuery Partitioning & Clustering

In this blogpost, I will explain what partitioning and clustering features in BigQuery are and how to supercharge your query performance and reduce query costs.

Partitioning

Partitioning a table can make your queries run faster while spending less. Until December 2019, BigQuery supported table partitioning only using date data type. Now, you can do it on integer ranges too. If you want to know more about partitioning your tables this way, check out this great blogpost by Guillaume Blaquiere.

Here, I will focus on date type partitioning. You can partition your data using 2 main strategies: on the one hand you can use a table column, and on the other, you can use the data time of ingestion.

This approach is particularly useful when you have very large datasets that go back in time for many years. In fact, if you want to run analytics only for specific time periods, partitioning your table by time allows BigQuery to read and process only the rows of that particular time span. Thus, your queries will run faster and, because they are reading less data, they will also cost less.

Creating a partitioned table is an easy task. At the time of table creation, you can specify which column is going to be used for partitioning, otherwise, you can set up the partitioning on ingestion time. Since you can query this table in the same exact way of those that are not partitioned, you won’t have to change a line of your existing queries.

select
day,
count(*)
from full_history
where sampling_date >= ‘2019–08–05’
and sampling_date < ‘2019–08–06’
group by 1

Assuming that “sampling_date” is the partitioning column, now BigQuery can use the specified values in the “where clause” to read only data that belong to the right partitions.

Bonus nugget

You can use partition decorators to update, delete, and overwrite entire single partitions as in:

# overwrite single partition loading from file
bq load —-replace \
project_id:dataset_name.table_name$20190805 \
gs://my_input_bucket/data/from/20190805/* ./schema.json

And

# overwrite single partition from query results
bq query —- replace --use_legacy_sql=false \
—-destination_table project_id:dataset.table$20190805 \
‘select * from project_id:dataset.another_table’

In the cases above, both the loaded data and the query results have to belong to the referenced partition, otherwise the job will fail.

Clustering

Clustering is another way of organizing data which stores one next to the other all those rows that share similar values in the chosen clustering columns. This process increases the query efficiency and performances. Note that BigQuery supports this feature only on partitioned tables.

BigQuery can leverage clustered tables to read only data relevant to the query, so it becomes faster and cheaper.

At the table creation time, you can provide up to 4 clustering columns in a comma-separated list e.g. “wiki”, ”title”. You should also keep in mind that their order is of paramount importance but we will see this in a moment.

In this section we will use “wikipedia_v3” form Felipe Hoffa’s public dataset, which contains yearly tables of Wikipedia page views. These are partitioned by the “datehour” column and clustered on “wiki” and ”title” columns. A single row may look like this:

datehour, language, title, views
2019–08–10 03:00:00 UTC, en , Pizza, 106
...

The following query counts, broken-down per year, all the page views for the Italian wiki from 2015–01–01.

select
_table_suffix as year,
wiki,
sum(views) / pow(10, 9) as Views
from `fh-bigquery.wikipedia_v3.pageviews_*`
where wiki = ‘it’and datehour >= ‘2015–01–01’
group by 1,2
order by 1 asc

If you write this query in BigQuery UI, it will estimate a data scanning of 4.5 TB. However, if you actually run it, the final scanned data will be of just 160 GB.

How is this possible?

When BigQuery reads only read rows belonging to the cluster that contains the data for the Italian wiki while discarding everything else.

Why is the columns order so important in clustering?

It is important because BigQuery will organise the data hierarchically according to the column order that is specified when the table is created.

Let’s use the following example:

select
wiki,
sum(views) / pow(10, 9) as Views
from `fh-bigquery.wikipedia_v3.pageviews_2019`
where title = ‘Pizza’
and datehour >= ‘2019–01–01’
group by 1
order by 1 asc

This query needs to access all the “wiki” clusters and then it can use the “title” value to skip the not matching clusters.

This results in scanning a lot more data than if the clustering columns were in the opposite order “title”, ”wiki”.

At the time of writing, the query above estimated a scanning cost of 1.4 TB but it actually scanned only 875.6 GB of data.

Let’s now invert the clustering columns order putting first “title” and second “wiki”, you can do so using the following command:

bq query --allow_large_results --nouse_legacy_sql \
--destination_table my_project_id:dataset_us.wikipedia_2019 \
--time_partitioning_field datehour \
--clustering_fields=title,wiki \
'select * from `fh-bigquery.wikipedia_v3.pageviews_2019`'

Running the “Pizza” query on our new table “my_project_id:dataset_us.wikipedia_2019” should be much cheaper. In fact, while the estimation was still of 1.4 TB, the actual data read was just of 26.3 GB, that is 33 times less.

As final test let’s try filtering on the “wiki” column:

select
wiki,
sum(views) / pow(10, 9) as Views_B
from `my_project_id:dataset_us.wikipedia_2019`
where wiki = ‘it’ and title is not null
and datehour >= ‘2019–01–01’
group by 1
order by 1 asc

The data read estimation is always the same but now the actually data read jumped to 1.4 TB (the entire table) whereas, in the first example, the actually data read was just of 160 GB.

Note: Since BigQuery uses a columnar store, “title is not null” ensures that we refer always to the same number of columns in every query. Otherwise, the data read from the last query is lower because we refer to fewer columns.

It is evident that choosing the right clustering columns and their order makes a great difference. You should plan it accordingly to your workloads.

Remember, always partition and cluster your tables! It is free, it does not need to change any of your queries and it will make them cheaper and faster.

Author’s Github and Twitter.

--

--

--

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Recommended from Medium

FM21: How to create a squad profile visualisation

It is quite unrealistic to decide on the benefits/harms of an intervention or the validity of a…

Bash, the Data Scientist’s Magnifying Glass

Elucidata BASH Tutorial

Using data science to improve mental health services for diverse communities

Rats in the City Pt. 4 — Time Series Analysis

The perfect internship for people who want to work as Data Analyst- Analyzing Quantium Analytics…

How To Prepare for a Data Science Bootcamp…

Examples of Natural Language Processing

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alessandro Puccetti

Alessandro Puccetti

Italian by birth but citizen of the world by choice Researched network measurement and security Opensource aficionado Juggle billions of events into the cloud

More from Medium

BigQuery UDFs Complete Guide

Materialized views in BigQuery

Load Data into GCP BigQuery Table using pandas DataFrame

Ways to Backup data in BigQuery