Operationalizing BigQuery ML through Cloud Build and Looker

Byron Allen
Sep 28 · 6 min read
Image for post
Image for post
Photo by Miguel Á. Padriñán

I’ve been working on a machine learning (ML) demo for Datapalooza hosted by Servian in the UK. If you’re reading this before 1 October 2020 — make sure to check out the event!

In my session, I explain the fundamentals of a feature store and how to create and modify it in BigQuery along with how to train and deploy ML models in BigQuery in addition to how that process can be operationalized through Cloud Build and Looker.

BigQuery ML is unique in that it is the only OLAP database that enables training and predicting from within! The typical pattern is to extract data from the database and train in a separate environment.

Looker has talked about the nexus between its business intelligence capability and BigQuery ML in the past. However, that perspective was more centered around making ML accessible to general analysts and business users stuck in spreadsheets. Whilst that is an important endeavor, it speaks more to those early in their ML journey, not to operationalizing an ML workflow using these tools.

The question then is open — how do you apply MLOps principles when using BigQuery ML?

Image for post
Image for post
Photo by Harrison Candlin

Operationalizing ML requires MLOps principals

MLOps, for those new to the term, is in a nutshell, ML + DevOps with an emphasis on reproducibility, accountability, collaboration, and CI/CD+CT of the ML model pipeline in production. In short, operationalizing the ML workflow.

I’m not going to get into a more robust explanation here, so you can learn more about MLOps through other sources. I recommend reading ‘MLOps: Continuous delivery and automation pipelines in machine learning’ and checking out the Slack and YouTube channels of the MLOps.community.

Image for post
Image for post
Could it truly be so simple? Mmmeh, kinda. Keep reading…

ML directly from the data warehouse

I’m going to skip any detailed explanation of BigQuery ML training and deployment. The Google documentation I linked to above is sufficient. For this article, the important point is that ML models can be trained directly from BigQuery!

Moreover, deployment, inference, scoring, or predicting is accomplished through the use of the `ML.PREDICT` function. All of this is accessible through BigQuery’s SQL syntax. And there are more functions to access params, metrics, and other metadata required for MLOps.

Image for post
Image for post
https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-train

Orchestration through Cloud Build

Image for post
Image for post
Photo by Sohel Patel

Cloud Build is Google Cloud’s native CI/CD tool.

To be more specific, Cloud Build jobs are defined and driven through config files that execute a series of steps in a remote environment. Each step can use the same or different builders, which are containers with a specific entry point along with all the required dependencies. For example, one builder might be invoked to run the CLI while yet another is invoked to run the CLI. Custom containers can also be used to create custom builders and entry points.

The builder is selected under the field and followed by that define the flags being called in that command. The field can be used to define environmental variables accessible to the builder. Additionally, the section can be used to define new values for environmental variables each time Cloud Build runs a job.

For details around the structure of these files check out ‘Build configuration overview’.

Below you’ll see an example from my demo. Here I create a custom container to execute subsequent steps that make up my BigQuery ML pipeline.

steps:
- id: build-container
name: 'gcr.io/cloud-builders/docker'
args: ['build', '.', '--tag=gcr.io/$PROJECT_ID/ml-workflow-container:3.7', '--tag=gcr.io/$PROJECT_ID/ml-workflow-container:latest']
- id: define-model-name
name: gcr.io/$PROJECT_ID/ml-workflow-container:3.7
args:
- ./define_model_name.py
dir: workflow
env:
- ML_MODEL_DATASET=ml_model
- id: create-model
name: gcr.io/$PROJECT_ID/ml-workflow-container:3.7
args:
- ./create_model.py
dir: workflow
env:
- PROJECT=project_id
- FEATURE_STORE_DATASET=feature_store
- ML_MODEL_DATASET=ml_model
- NUM_CLUSTERS=${_NUM_CLUSTERS}
- DISTANCE_TYPE=${_DISTANCE_TYPE}
- STANDARDIZE_FEATURES=${_STANDARDIZE_FEATURES}
- EXCEPT_FEATURES=${_EXCEPT_FEATURES}
images: [
'gcr.io/$PROJECT_ID/ml-workflow-container:3.7'
, 'gcr.io/$PROJECT_ID/ml-workflow-container:latest'
]
substitutions:
_NUM_CLUSTERS: '3'
_DISTANCE_TYPE: 'COSINE'
_STANDARDIZE_FEATURES: 'TRUE'
_EXCEPT_FEATURES: None

The custom container is generated through the first step and defined in a located in the root directory. The second step then begins the pipeline in earnest. In my demo, I chose to modularise each step, breaking them into separate Python scripts. The first of which is located in the directory.

Through this modularised pipeline, I am able to (re)train a model, use new parameters, and store model artifacts/params/metrics all at the same time.

Cloud Build also offers integration into Git. I can even trigger the pipeline when I push a new commit to a specific branch or through an HTTP request. The latter becomes critically important when we tie components together.

Image for post
Image for post
Screenshot of the Cloud Build GUI on Google Cloud

Insights with Looker and then go one step further

Looker is a business intelligence tool — you know — visualizations, dashboards, reporting. Sometimes these assets are purely about business insights, sometimes they are more operational in nature.

“Operational” takes on a new meaning when you consider the action functionality that Looker enables. It allows developers to make an HTTP request directly from Looker! What that means is we can in fact trigger the Cloud Build pipeline we built above through a GUI instead of CLI.

Image for post
Image for post
Example Looker dashboard

Moreover, we can pass ML model parameters through a form that a subsequent Cloud Function takes and uses to populate the substitution variables mentioned above.

Image for post
Image for post
Example Looker action form

This pattern is immensely more convenient and easy to use, allowing data scientists to experiment, re-train, monitor and deploy ML pipelines (not models) from a GUI.

Click, fill the form, retrain.

Applying MLOps principles to BigQuery ML

So, back to that question — “how do you apply MLOps principles when using BigQuery ML?” Well, with the above pattern I’ve walked through, a simple interpretation of MLops principles has been applied thanks to the addition of Cloud Build and Looker.

Easy peasy, lemon squeezy.

At this moment in the industry, or amongst Google Cloud practitioners, the knee-jerk reaction is to jump to AI Platform or Kubeflow as a tool of choice for MLOps. Both are outstanding tools. Also, both assume a level of skill and development, and Kubeflow assumes an additional level of management overhead, that not everyone can immediately do.

The pattern I’ve discussed here is more approachable to a broader audience and removes many steps that would take place when developing on AI Platform or Kubeflow. While BigQuery ML won’t work for every use case it can work for many.

The typical thinking is that BigQuery ML offers a way to quickly prototype ML use cases. While that’s true, we already see production use cases leveraging BigQuery ML and more models have been added this year to boot. This will only increase in the future.

If you’re already on Google Cloud, there’s a good chance that most of your analytics pipeline is already in BigQuery. So why not utilize that serverless, OLAP database storage and compute for ML as well?

But what about online prediction?

For those, thinking I’ve left that stone unturned, consider exporting BigQuery ML models and deploying them in AI Platform, Kubeflow, or a third-party tool like Seldon for online inference and monitoring.

Servian

The Cloud & Data Professionals

Byron Allen

Written by

Texan transplant to Australia turned Australian transplant to the UK | ML Engineer | Senior Consultant at Servian

Servian

Servian

At Servian, we design, deliver and manage innovative data & analytics, digital, customer engagement and cloud solutions that help you sustain competitive advantage.

Byron Allen

Written by

Texan transplant to Australia turned Australian transplant to the UK | ML Engineer | Senior Consultant at Servian

Servian

Servian

At Servian, we design, deliver and manage innovative data & analytics, digital, customer engagement and cloud solutions that help you sustain competitive advantage.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store