BigQuery Flex Slots with Cloud Functions

Taming spiking BigQuery workloads using Cloud Functions, metrics and the slot estimation tool.

Alistair Grew
Appsbroker CTS Google Cloud Tech Blog
6 min readNov 21, 2022

--

Source: https://memeshappen.com/meme/86874/YOU-STAY-FLEXIBLE

PLEASE NOTE THIS POST WAS WRITTEN BEFORE BIGQUERY EDITIONS SO IS LARGELY IRRELEVANT

A brief introduction to BigQuery

BigQuery is one of the strongest tools in Google Cloud’s toolbox. For those who aren’t familiar, it is a petabyte-scale, serverless data warehouse tool that can be queried with SQL. This makes it extremely popular in a lot of organisations that are driving data-driven transformation. The data from BigQuery is often used to drive downstream applications including but not limited to Looker, Looker Studio (formally Data Studio), and Appsheet amongst many others.

From an architectural standpoint BigQuery isn’t designed like a ‘conventional’ database in that the storage and compute are separated as per this diagram:

Source: https://cloud.google.com/blog/products/data-analytics/new-blog-series-bigquery-explained-overview

This separation is one of the reasons why compared to ‘always on’ databases BigQuery can be cheap, in that you pay for storage and then only compute capacity when you need to run jobs. So what constitutes a job I hear you ask? Well to quote the documentation:

Jobs are actions that BigQuery runs on your behalf to load data, export data, query data, or copy data.

These jobs are executed using compute capacity represented as ‘slots’, which can be broadly considered as CPU cores. For example, when a SQL Query is executed, BigQuery works out the amount of compute capacity required to run the query and then executes it (more detail on this is here).

By default when working in BigQuery, ‘interactive mode’ is used whereby a capacity of up to 2,000 slots is provisioned dynamically to meet demand. However, there is another option, ‘Flat-Rate Pricing’.

BigQuery Slot Reservations

So there are three types of slot commitment:

  • Monthly
  • Annually
  • Flex

Monthly and Annual commitments primarily enable you to save money by reserving a level of slot capacity (and therefore helping Google do its own capacity planning). In buying these types of commitments though there needs to be a mindset shift from minimising use of BigQuery to maximising the use of the capacity purchased, to get the most for your money. This is great if you are running time-insensitive workloads or have a consistent (over 24 hours) load. These commitments however aren't good for dealing with bursty workloads and for this we have flex, but I am going to talk about it in the context of a customer problem I have encountered recently.

The Customer Problem

So this customer operates BigQuery for its data warehouse backend, which then serves several frontend business reporting dashboards built in Looker, Looker Studio, and several other applications. The dashboards especially are accessed interactively by end users at specific peak times as per these graphs:

Graph showing maximum query count over 3 days
A graph over the same time period showing slot utilisation

As you can see there are clear spikes in utilisation during the day especially the 7–10AM window whereby we hit the 2,000 slot limit for interactive mode.

Google has just gone GA with its slot estimation tool which can also help model certain reservations and potential impact on performance, the customers graph for the project containing BigQuery looks like this:

Slot estimation tool output for the customer project (over about a month)

As you can see Google has also helpfully suggested some monthly reservations with differing slot counts and performance characteristics (and yes the far right option would be $1.2m more a month). Below these suggestions, you are then also able to show the performance of different query types with these different slot combinations.

Slot modeling in the estimation tool.

So in addition to the above monthly and annual reservation options, there is a third option, namely, flex slots. Flex is a type of reservation that allows you to commit resources for as little as 60 seconds, the discounts aren't quite as good as the longer commitments but suit workloads where you may need a little more capacity for a peak period before dropping back to interactive mode in quieter periods. Sounds great doesn’t it? But this then prompts the next question, how do we add and remove this reservation in an automatic way based on the schedules we want?

Creating Flex Reservations with Cloud Functions

Firstly I was to attribute credit where it is due, Charles Verleyen has already written an excellent post on this topic and much of the code I produced for my solution was taken from his example.

There are actually a few different ways you can purchase slot capacity, add it to a reservation, and assign it. You can do this straight from the console, with a CREATE CAPACITY DDL SQL statement within BigQuery, with a bq mk command, or finally by interacting with the BigQuery Reservation API.

For my solution, I have opted for the latter option, this is mainly due to my familiarity with using python functions to do other scheduled tasks and I like the way cloud functions can be triggered with a cloud scheduler in a simple manner.

This is the gen1 Cloud Function code I have written to make the capacity reservation and make the assignment. To execute correctly the function will need to run with the appropriate BigQuery permissions (the BigQuery resource admin role should have you covered for a given project through a custom role might suit better if you wish to span multiple projects)

Talking through the main python code, first I take in a JSON request from Cloud Scheduler which contains a ‘task’ value of either create or delete. If the job is to ‘create’ a reservation it purchases slots, adds them to a reservation, and then assigns that reservation to the project. Finally, it waits 60 seconds for the slots to become ready.

If the task is a deletion it lists all the reservations for the project and systematically removes the assignment, then the reservation, and finally the capacity commitment.

To trigger this function I have a couple of cloud scheduler jobs configured to run at set times:

Concluding thoughts

The main thing I like about this solution is that I can quite easily add additional schedules to target other projects or ramp up flex slots over a period of time before. Using a function also allows a great deal of flexibility if for example, I wanted it to be triggered when slot capacity hit a certain threshold in Google monitoring. What this solution would struggle to do however in the broader context is handle complicated commitments and reservations spanning multiple projects, if you want to understand some of what is possible regarding this I suggest this video:

Anyway hopefully the above is helpful for someone, and if you have thoughts on how it could be improved I would love to hear them. But until next time, keep it Googley ;)

About CTS

CTS is the largest dedicated Google Cloud practice in Europe and one of the world’s leading Google Cloud experts, winning 2020 Google Partner of the Year Awards for both Workspace and GCP.

We offer a unique full stack Google Cloud solution for businesses, encompassing cloud migration and infrastructure modernisation. Our data practice focuses on analysis and visualisation, providing industry specific solutions for; Retail, Financial Services, Media and Entertainment.

We’re building talented teams ready to change the world using Google technologies. So if you’re passionate, curious and keen to get stuck in — take a look at our Careers Page and join us for the ride!

--

--

Alistair Grew
Appsbroker CTS Google Cloud Tech Blog

GCP Architect based in the Manchester (UK) area. Thoughts here are my own and don’t necessarily represent my employer.