Build a recommender with BigQuery ML

Part 2: Flex slots and Cloud Workflows

Alexander Wijns
g-company
8 min readFeb 1, 2022

--

Photo by EJ Strat on Unsplash

In the previous post, we talked about why you would want to use matrix factorization (and collaborative filtering in general) to build a recommender system, and why BigQuery provides a good place to do it, even if it’s only for a baseline model. In this post, we’ll assume that after reading part 1 you now have an uncontrollable urge to generate recommendations with BigQuery ML, but need a little help getting started. To this end, this post will be a lot more hands-on, using SQL stored procedures to define the necessary tasks in our small pipeline and Cloud Workflows to automate these tasks in a fully serverless way.

As we discussed previously, before you can create and train a matrix factorization model in BigQuery, you need to reserve the slots that will perform the necessary data processing. We’ll kick things off with that.

Flex slots

Until not so long ago, you could only reserve slots on a monthly or yearly basis, which would be overkill for a daily or weekly scheduled batch process, let alone a quick trial run. Luckily, these days you can also reserve what are called Flex slots. In a nutshell, you can reserve a number of slots when you need them, run your job, and turn them off again when the job is done. A crucial point here is that you keep paying for those slots until you explicitly turn them off again. That fact alone should already make the little voice in your head go “automate this sh*t.” Let’s first spend a few words on how to turn on and off a Flex slot reservation and organise these tasks in a way that will make it easier to orchestrate and automate the flow later on.

Setting up a Flex slots reservation comprises three steps:

  1. Capacity commitment: commit to paying for a fixed number of slots for a certain duration. The beauty of Flex slot is that you only need to commit for 60 seconds at the start, and can cancel at any time after that first minute.
  2. Reservation: optionally split the commitment into different parts that fit how you organise your workloads (for example, things like production, test and analytics) so that these different workloads with potentially very different constraints and priorities won’t fight each other over resources. If you don’t do this the whole commitment is linked to one default reservation.
  3. Assignment: to actually use the slots you then need to assign a project, folder or organisation to one of the reservations. Projects that are not assigned to a reservation will keep using the default on-demand pricing (and won’t be able to run matrix factorization, for example).

A first (optional, but useful) step towards automation, is to group these three tasks into one SQL stored procedure in BigQuery, as follows:

Note that the “admin project,” where you create the commitment and manage the reservations, does not need to be the same as the assignee in the final statement. It’s in this step that you choose which projects or folders will actually make use of which chunks of the reserved slots by assigning them to a specific reservation. In this example, we set the capacity to 100 slots, which is the minimum commitment size and should be more than enough to get you started.

Very importantly, you also want to be able to just as easily turn them back off. You do this in reverse order: first remove the assignment(s), then the reservation(s) and then finally the commitment itself:

Creating the model

In order to automate the whole process, it’s also useful to store the creation (and training) of the BigQuery ML matrix factorization model in its own procedure:

Notice how easy it is to train a machine learning model with just a few lines of SQL code. Even though it might at first glance look like you are only setting up the model with this statement, BigQuery will automatically also train it on the data as it is defined in the SELECT clause. Needless to say that this clause could involve any data transformations you need (but do also have a look at the TRANSFORM clause), as long as at the end the columns expected as input by the model are selected, in this case user, item, and rating.

For pedagogical reasons, we kept this model creation statement as simple as possible, because the defaults are sensible for many use cases. But the OPTIONS clause allows for a lot of customization if you know what you’re doing. For example, explicit feedback (see part 1) is the default, but if you want to train a model based on implicit feedback, you can set FEEDBACK_TYPE="IMPLICIT". Another important hyperparameter to experiment with, in case the default does not suit your needs, is the number of latent factors NUM_FACTORS (again, see part 1).

I also hear you thinking “Wouldn’t it be better to parametrize these queries, instead of hardcoding the project and dataset name all over the place?” and you would, of course, as you probably usually are, be absolutely right. Unfortunately, even though there is such a thing as parametrized SQL queries, they can’t be used for identifiers such as table names. In order to do this properly (read: without a lot of bash hacking), I would advise looking into tools like dbt.

Cloud Workflows

Now, as much as you’d surely enjoy running these three procedures by hand every day, let’s turn our attention to automating them with an orchestration tool. It might feel like you’d just about get away with shell scripting this small workflow right now, but especially once this becomes part of a bigger production pipeline, you’ll want to look into more robust options. There are many orchestration solutions out there, several of them managed on, or native to, GCP, but those either require you to manage some infrastructure, or might be a bit heavy handed and costly for your needs, often both. A very nice, fairly recent offering on GCP is Cloud Workflows, a completely serverless way to orchestrate pipelines of services on GCP. Moreover, it allows you to start small just as easily as you would with a shell script, but will grow with you as your needs grow much more willingly than good old (but occasionally grumpy) bash.

Cloud Workflows allows us to run our three tasks one after the other, each time waiting for the previous step to complete, with a simple YAML instruction file. You can find the full code example here (and the full specification here), but let’s break it down step by step.

First, we create the Flex reservation and wait a bit for the slots to become available:

You can see here that a workflow is composed of several steps:

  1. The first step, init, (you can give these steps any name you want, as long as it follows a few conventions) defines a few variables for easy reference throughout the workflow.
  2. Then, in create_slots, we use the BigQuery API connector to conveniently call the jobs.insert endpoint to invoke our get_slots procedure.
  3. In the log_create_job_id step, we use the handy integration with Cloud Logging to log the id of the created job for later reference or debugging.
  4. Then, it’s crucial to wait for the slots to be available before moving on to the next step. There are a few options here. In this example, we’ve chosen the lazy approach, namely to simply wait for 2 minutes by calling the built-in sys.sleep. This will typically be sufficient, but is not guaranteed to be enough.

You might be tempted to use the job id that was returned by the create_slots step in order to check the status of the slot creation (see below for an example of this approach), but it turns out that even if the job status is “DONE” the slots might not be quite ready for use yet. As an exercise for the adventurous reader, I suggest looking into the INFORMATION_SCHEMA.RESERVATIONS_BY_PROJECT metadata table. You can query this table to make sure the reservation is there before proceeding to the next step. If it’s not, you sys.sleep, a bit longer in a conditional loop (see below).

Next, assuming we’ve made sure that the slots are ready for action, we create and train the model:

Again, because we’ve nicely prepared our procedures, training the model is once more a simple query passed on to the jobs.insert endpoint. This time, we do check the job id to make sure the model is fully trained before moving on to the next step. To achieve this, we use a switch block to check that the job status equals “DONE”. If not, we push some status info to the logs and sleep for a while. We could consider using exponential backoff here, making the sleep times longer and longer until some threshold is reached, after which we might prefer to abort the workflow altogether and trigger an alert to an administrator account. Do keep in mind here that training can take a while, depending on the amount of training data. If the status is “DONE”, we continue.

The final step is now easy:

We call the drop_slots procedure, get the job status and use that as a return value of the workflow. But notice that we’re again being a bit sloppy here. We’re not properly handling a possible failure to drop the slots, so that we risk paying for them for much longer than we need to. You could catch the exception and do a few retries before really giving up. Arguably, you’d anyway want to monitor the resources through independent services and take appropriate action in case slots are turned on for much longer than expected.

Wrapping up

This example workflow gets you 70% of the way as far as model training goes. As we’ve highlighted throughout, there are a few places in the workflow where you’d want to build in some more robust status checks, exception handling, etcetera. Another useful exercise is to parametrize the workflow in order to remove any explicit mention of the project and dataset, which is not hard to do and will make your workflow sit more snugly inside your Terraform setup.

Then, the next step would, of course, be to use this model to generate some batch predictions and store them in your production database so that you can use the results in your application. This smells like a whole other workflow to me! The whole thing could be kicked off daily (for example) by a scheduler.

If all of this still feels a bit daunting to you, feel free to contact us! And see you next time.

Big shout-out to Bas Leenders for helping out with the final tweaks and edits of part 1 and 2!

--

--