An End to End Machine Learning Model Development Guide Using BigQuery ML

Yogesh kothari
Google Cloud - Community
5 min readMar 16, 2023

This Guide will help you create and execute ML models in BigQuery using GoogleSQL queries. We will be creating ML models using the Google Cloud Console.

Step 1 : — Create Your Dataset in BigQuery

The dataset you create, will store your ML model. To create a dataset, please follow the below instructions.

  1. In the Google Cloud console, go to the BigQuery page.
  2. In the navigation panel, in the Resources section, click your project name.
  3. On the right side, in the details panel, click Create dataset.

4. Fill in the dataset ID in the Create Dataset window.

You are all set with your new dataset.

Step 2 : — Create your ML model using GoogleSQL

Let us first talk about the different types of models which are supported in BigQueryML. BQML supports wide variety ML models, right from Linear regression to Timeseries forecasting to Deep Neural Networks. Below is the list of supported models provided in the GCP BigQueryML documentation.

  • Linear regression for forecasting; for example, the sales of an item on a given day. Labels are real-valued (they cannot be +/- infinity or NaN).
  • Binary logistic regression for classification; for example, determining whether a customer will make a purchase. Labels must only have two possible values.
  • Multiclass logistic regression for classification. These models can be used to predict multiple possible values such as whether an input is “low-value,” “medium-value,” or “high-value.” Labels can have up to 50 unique values. In BigQuery ML, multiclass logistic regression training uses a multinomial classifier with a cross-entropy loss function.
  • K-means clustering for data segmentation; for example, identifying customer segments. K-means is an unsupervised learning technique, so model training does not require labels nor split data for training or evaluation.
  • Matrix Factorization for creating product recommendation systems. You can create product recommendations using historical customer behavior, transactions, and product ratings and then use those recommendations for personalized customer experiences.
  • Time series for performing time-series forecasts. You can use this feature to create millions of time series models and use them for forecasting. The model automatically handles anomalies, seasonality, and holidays.
  • Boosted Tree for creating XGBoost based classification and regression models.
  • Deep Neural Network (DNN) for creating TensorFlow-based Deep Neural Networks for classification and regression models.
  • Vertex AI AutoML Tables to perform machine learning with tabular data using simple processes and interfaces.
  • TensorFlow model importing. This feature lets you create BigQuery ML models from previously trained TensorFlow models, then perform prediction in BigQuery ML.
  • Autoencoder for creating Tensorflow-based BigQuery ML models with the support of sparse data representations. The models can be used in BigQuery ML for tasks such as unsupervised anomaly detection and non-linear dimensionality reduction.

Now, depending on your use case, You can identify the type of machine learning model you need (We will not be talking about the core ML concepts and how to choose the ML model for the use case in this article.). Once we have the type of ML model that we need, we can use the GoogleSQL queries to create the model. The syntax for creating the model can be found here.

For example, to create a logistic regression model using the Google Analytics sample dataset for BigQuery. The following GoogleSQL query is used to create the model you use to predict whether a website visitor will make a transaction.

#standardSQL
CREATE MODEL `bqml_tutorial.sample_model`
OPTIONS(model_type='logistic_reg') AS
SELECT
IF(totals.transactions IS NULL, 0, 1) AS label,
IFNULL(device.operatingSystem, "") AS os,
device.isMobile AS is_mobile,
IFNULL(geoNetwork.country, "") AS country,
IFNULL(totals.pageviews, 0) AS pageviews
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN '20160801' AND '20170630'

You can find the the full details about the above model and the query that is being used, in the quickstart guide on the GCP documentation.

To create the model, you need to run the above query in the BigQuery query editor. Paste the query in the editor and click on the Run button.

This will create the model ‘sample_model’ in your dataset ‘bqml_tutorial’.

Step 3 : — Evaluate your Model

Model evaluation is a process of assessing model’s performance. This is the important step in model development lifecycle. There are multiple ways of doing the evaluation, we will not be covering that as part of this article. GCP documentation describes the uses of model evaluation metrics are:-

  • Assess quality of fit between the model and the data
  • To compare different models
  • In the context of model selection, and to predict how accurate each model can be expected to perform on a specific data set

In BigQuery ML, ML.EVALUATE function evaluates the predicted values against the actual data. Below is the Snippet of the SQL that can be used.

#standardSQL
SELECT
*
FROM
ML.EVALUATE(MODEL `bqml_tutorial.sample_model`, (
SELECT
IF(totals.transactions IS NULL, 0, 1) AS label,
IFNULL(device.operatingSystem, "") AS os,
device.isMobile AS is_mobile,
IFNULL(geoNetwork.country, "") AS country,
IFNULL(totals.pageviews, 0) AS pageviews
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN '20170701' AND '20170801'))

Complete guide for model evaluation methods available on BigQuery ML can be found here.

Step 4: — Model Predictions

The final step of your ML model is to provide inference on the previously unseen data and performing at forecasting, recommendation and anomaly detection tasks. Depending on the kind of ML model that you trained below are the few BigQuery ML methods can be used to do inference:-

  • ML.PREDICT
  • ML.FORECAST
  • ML.RECOMMEND
  • ML.DETECT_ANOMALIES
  • ML.RECONSTRUCTION_LOSS

For the above problem statement of whether a website visitor will make a transaction, we will use ML.PREDICT. Run below SQL in the query editor.

#standardSQL
SELECT
fullVisitorId,
SUM(predicted_label) as total_predicted_purchases
FROM
ML.PREDICT(MODEL `bqml_tutorial.sample_model`, (
SELECT
IFNULL(device.operatingSystem, "") AS os,
device.isMobile AS is_mobile,
IFNULL(totals.pageviews, 0) AS pageviews,
IFNULL(geoNetwork.country, "") AS country,
fullVisitorId
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN '20170701' AND '20170801'))
GROUP BY fullVisitorId
ORDER BY total_predicted_purchases DESC
LIMIT 10

Complete guide for model inference on BigQuery ML can be found here.

Other Useful BigQuery ML Functionalities

Below is the list of out-of-the box available BigQuery ML features along with their detailed document links.

I hope this guide will help you feel more confident in creating ML models and using the BigQuery ML with lot more ease.

--

--