BigQuery Machine Learning Cheat Sheet

Published in

fifty-five | Data Science

4 min readDec 20, 2021

As an ML engineer, working with both Data Scientists and Data Analysts, I’ve struggled for a while minimizing the time for machine learning model testing and deployment, while maintaining business-required performances. BigQuery ML stands in an in-between place, avoiding data export while facilitating infrastructure scalability and code sharing. Below is the SQL-Command cheat sheet I wrote for creating, evaluating BQML models, and predicting outcomes. The high-resolution version of the cheat sheet is available at the end of the article.

Business Intelligence (BI) descriptive approach has shifted toward a more predictive and prescriptive analysis. Based on these changes, the analytic framework has been revised to include a data science layer.

The combination of traditional business intelligence and data science is seen as the future of the field. Consequently, emerging cloud-based services are now presented as one integrated service. They merge different technologies, including data warehouse, machine learning framework, and visualization tool in order to facilitate access to both data analysts and data scientists.[3]

BigQuery Machine Learning (BQML) comes within this scope of opening machine learning models to a wider audience of users while facilitating the model building, evaluation, and deployment of machine learning models.

SQL language only

SQL is still one of the most widely used languages in data analytics. As a declarative language, SQL lets you define the tasks without worrying about the specifics on how to complete them.[1]

BQML allows you to specify and run machine learning models directly in SQL. It makes machine learning accessible to data analysts who may not have substantial experience with it.

In other words, you don’t need to be an expert machine learning engineer in order to execute ML models.

Iterate and deliver faster

Improving the time to value as well as minimizing the cost of implementation and infrastructure are the primary drivers for organizations to deliver data science solutions.

Yet, the delivery involves collaboration between small teams, business subject experts, and technical resources, which can become highly iterative and time-consuming.

When it comes to collaboration, simplified cloud environment infrastructures integrate everything needed in a unified service. BQML supports streamlined data science processes, cutting off required third-party services usually used to train models on extracted data. It shortens the setup and maintenance of machine learning environments while offering a quick and easy way to train and deploy machine learning models.

Model versatility

BQML supports supervised, unsupervised, time series, and custom imported models of several types, including regression, binary classification, k-means clustering, forecasting, and neural networks.

Each of these comes with a similar procedure involving data preprocessing, feature selection, model creation, performances evaluation, and model deployment for inference purposes.

Procedure

The machine learning flow is described in the BQML cheat sheet, and can be summarized in four main steps:

Prepare the data

For every ML project, you’ll start preparing your input data by selecting a subset of data that is salient to answering the business question. Depending on your dataset, you might require several steps, including outliers treatment, and feature selection.

When it comes to feature selection, BQML will handle all the following preprocessing steps for you:

Eliminate missing values
Features encoding
Scaling
Feature extraction

ML.TRANSFORM clause can be used to adapt the aforementioned preprocessing steps, adapting the feature extraction and the preprocessing functions like scaling and bucketing.

Create the model

Next, CREATE MODEL statement handles the creation and training of the model using supplied features. All you need to do is specify the model type in the options.

CREATE MODEL `mydataset.mymodel`
  TRANSFORM(
    ML.FEATURE_CROSS(STRUCT(f1, f2)) as cross_f,
    ML.QUANTILE_BUCKETIZE(f3) OVER() as buckets,
    label_col
  )
  OPTIONS(
    model_type=’linear_reg’, input_label_cols=['label_col']
  ) AS
  SELECT
    label,
    input_column1
  FROM
    `mydataset.mytable`

This will store the model as a BigQuery object, which can be retrieved for evaluation and prediction.

Evaluate the model

ML.EVALUATE displays the performance of your trained model. This will return different statistics depending on the type of model. These metrics may indicate the need to adapt the model options and hyperparameters, as well as to refine the features selection and extraction steps.

SELECT
  *
FROM
  ML.EVALUATE(MODEL `mydataset.mymodel`,
    (
    SELECT
      custom_label,
      column1,
      column2
    FROM
      `mydataset.mytable`)

Predict

Prediction is accomplished through the use of the ML.PREDICT keyword. This function applies a stored model to a given source of data in order to predict the outcome. You don’t need to worry about any of the feature encoding, scaling, or missing values, as they are already stored within the model object.

SELECT
  *
FROM
  ML.PREDICT(MODEL `mydataset.mymodel`,
    (
    SELECT
      label,
      column1,
      column2
    FROM
      `mydataset.mytable`))

Conclusion

BigQuery ML, or BQML, is intended to democratize machine learning and facilitate discovery, design, and development phases of data science processes.

It enforces communication across business experts, data scientists, and engineers, prioritizing the information value chain over prolonged development.

Using SQL directly in the BigQuery Console, it is possible to train and deploy machine learning models to be reused upon prediction. BQML already supports a spectrum of available models (and probably more to come) which share a similar creation, evaluation, and prediction process.
This process is summarized in the shared cheat sheet.

Sources and Further Readings

[1] Tang, C., Wang, B., Wu, H., Wang, Z., Li, Y., Channapattan, V., … & Mishra, P. (2021). Hybrid-cloud SQL federation system at Twitter. In European Conference on Software Architecture.*

[2] Mucchetti, M. (2020). BigQuery ML. In BigQuery for Data Warehousing (pp. 419–468). Apress, Berkeley, CA.

[3] Larson, D., & Chang, V. (2016). A review and future direction of agile, business

Google. (n.d.). Bigquery ML Documentation. Google, from https://cloud.google.com/bigquery-ml/docs