BigQuery Slot Squeezes

Stephan Meyn
Google Cloud - Community
4 min readSep 2, 2022

Bigquery (BQ) is a serverless large scale analytical database service. This service allows you to query anything from very small data sets to petabyte size data sets. Behind the scenes, BigQuery will automatically allocate compute, memory and shuffle capacity to service your query request. At the heart of this is the concept of a slot. A slot is a hybrid metric that combines these various resources as a unified consumption value.

When you run a query and look at the query execution details you will see a metric called slotMs. This is the indicator how much resource your query consumed.

Statistics Section of a BigQuery Queryplan

Bigquery has two pricing models, on demand and flat rate. On demand means you have a limited number of slots for your project and you pay for the bytes processed. Flat rate means you purchase and pay for a monthly allocation of slots but don’t pay for the bytes processed. Typically, the former is good for small, incidental use and the latter works well in an enterprise setting with a constant, large, demand.

One issue in an enterprise setting is the question of how many slots you should purchase. Buy too many, and you will have long periods where they are unused. Buy too few and users will experience delays during peak time.

So you are interested in determining how many slots you areusing during the course of a day/week/month to get a profile of your needs and then determine a base purchase of flat rate slots together with dynamic flex slot purchases to cover peak periods.

The simplest way would be to query your system tables to extract and sum all the slot MS values of the jobs timeline using the Informationschema. jobs_timeline table.

and graph a summary over the course of a day(week/month):

This will show you actual slot consumption and you will be able to see when you are consuming the capacity of your slot purchases.

But this information is limited. You know at what time of the day you maxxed out your purchased slot capacity, but you don’t know how badly it affects your users.

How BQ slot allocation affects performance

Queries do not have a constant need for slots, it may go up and down during the query execution. Bigquery attempts to allocate slots to queries depending on their needs in a fair manner. As a query’s slot needs goes down, the slots get reallocated across all other queries according to their needs. This is a process that is very dynamic and slot allocation can change within a few seconds.

When the queries need more slots than are available, then work piles up, waiting for resources to become available — in short, your queries take longer.

You can find out about this by looking at your query execution plan (e.g. using https://bqvisualiser.appspot.com). Each stage has a number of metrics regarding performance. The one of interest here is the metric called avg_wait_ms, which is the average time, in milliseconds, workloads for a stage spend waiting.

Another interesting metric: avg_compute_ms vs max_compute_ms

While avg_compute_ms looks across all workloads for a stage to determine the time to complete, max_compute_ms looks at the maximum value. Usually the max value should not be too different from the average. But under some circumstances these will differ largely.

Typically this occurs when the shards of data to be processed vary in size. If you group data and aggregate them, badly balanced grouping key distributions will cause this. If you experience this, look at your data and consider structuring your query to avoid this situation.

The above metric told us, when we are running out of slots. With this knowledge we can now construct a metric that tells us about how bad the resulting slot squeeze is:

Here we calculate several metrics for stage waiting time. For a general indicator the metric wait_avg does a good job highlighting the impact of a slot squeeze.

The diagram now shows clearly that there is a period from August 29, 6 am to 2 pm, where major stages had to wait more than 10 minutes — meaning that major queries took at least btw10 minutes longer than usual.

From a business perspective, can users afford to wait for their queries to complete during this period? If not then purchasing flex slots for this period is likely to reduce this wait time.

--

--