Nerd For Tech
Published in

Nerd For Tech

Introduction of BigQuery ML

A quick introduction of how to use BigQuery ML to rapidly test & pivot machine learning forecasts. By Alan Kohler

Photo by h heyerlein on Unsplash

In this blog post, I will introduce BigQuery ML. The readers should already have a basic understanding of Machine Learning to fully understand all the concepts. If you need it, you can find a good ML crash course on this link.

BigQuery ML is a built-in tool in BigQuery that lets Data Analysts quickly test and execute machine learning models with data stemming from any structured dataset already available in the data warehouse and by comfortably using SQL-like commands. The big advantage of this tool is that it is required to lose valuable time to export the dataset to a different location and no need to know a more advanced programming language like Python or Java to set up the model.

The built-in model types currently supported are linear and logistic regressions, k-means clusterings for data segmentation, matrix factorizations for developing recommendations engines, time series and XGBoost (an open source regularizing gradient boosting framework, for more information about it check this link). Furthermore, it is also possible to use an external connection with the Vertex AI AutoML Tables tool and to import custom Tensorflow models.

In the next sections of this blog post, I´ll introduce the process to set up a model, evaluate it, and using it for predictions on the fly. Furthermore, I´ll also include an overview of the pricing structure of this toolset.

Photo by Pietro Jeng on Unsplash

Part A — Create a BigQuery ML model

To use BigQuery ML, you need to create a training table with all the columns that you intend to use for your target and for your features. After that, you can use the “CREATE OR REPLACE MODEL” statement to create a model in a target destination dataset.

With the “TRANSFORM” statement you can manually preprocess your features with standard transformations like “LOG, ABS or AVG”. You can apply “ML.QUANTILE_BUCKETIZE” to divide a feature into intervals, “ML.FEATURE_CROSS” to create an interaction feature variable or “ML.POLYNOMIAL_EXPAND” to create higher-order polynomial features. You can also scale your features with “ML.MIN_MAX_SCALER” or with the standardized version “ML.STANDARD_SCALER”. The complete guide of manual preprocessing ML functions can be found here.

With the “OPTIONS” statement you have to specify the type of model to be used and eventually other settings like the number of iterations to be performed and some hyperparameter tuning settings like the optimization algorithm, the learning rate and the value of the L1 and L2 regularization. The complete list of parameters that can be used inside the “OPTIONS” statement can be found here.

Then you need to specify your training dataset with standard SQL. It is important to notice that the target variable need to be named “AS label”, so that BigQuery ML can recognize it as the target column.

In the image below you can see a summary of the SQL commands to create an ML model. When you run the query the model will be trained. Depending on the size of your datasets, the training time can take from a few minutes to several hours to compute.

Part B — Evaluate a BigQuery ML model and use it for predictions

To get some statistics about the training you can use the statement “ML.TRAINING_INFO”. In particular, it will retrieve a table with the number of iterations and the improvement of the loss function and the processing duration of each step in milliseconds. You can find more information on this statement here.

With the “ML.FEATURE_INFO” statement you can see the descriptive statics of the feature in your dataset. It includes a count of the data points, the min, the max, the mean value and the standard deviation for each feature. The weights “ML.WEIGHTS” can be used to assess the weight of each feature after the training process has been completed.

With “ML.EVALUATE” you can evaluate your model. In the case of linear regression, it will retrieve the MAE, the MSE, the RMSE, the R-Squared score and the explained variance. In the case of logistic or multiclass regression, it will retrieve the precision, the accuracy, the F1 score, the logarithmic loss and the ROC AUC metric. More information on evaluating a model can be found by clicking on this link.

You don’t need to worry about deploying your model in a separate process as it is automatically available to serve predictions directly in BigQuery by using the “ML.PREDICT” statement. The query syntax to make predictions on test datasets (with the same features columns as in the training dataset, but different rows of data) is summarized in the image below.

Pricing

The first 10 GB of queries processed to train and create a built-in model are free of charge. This quota does not apply to externally imported models. After that, you will pay 250$ for every TB of data processed to create built-in models and 5$ per TB for evaluating, inspecting, and for predicting results. For externally generated models on the Vertex AI, you will pay 5$ per TB. High-volume users and big corporations can also opt for a monthly flat rate BigQuery subscription. You can find more information about the BigQuery ML pricing structure here.

Conclusion

In this blog post, BigQuery ML was briefly introduced. To use it in practice you can find the summarized procedure on the cheat sheet in the screenshot below.

Photo by Coursera

As previously stated, BigQuery ML is a very good tool to quickly test and pivot ML approaches from datasets already stored in BigQuery. Nevertheless, it is usually hard to get very accurate predictions and to appropriately tune the models. If more advance ML solutions are required, it is usually better to assign the work to expert Machine Learning Engineers or to get hands on the development of custom models, for example with Python packages like Tensorflow, ScikitLearn or Pycaret. Additional Machine Learning strategies might be published in future blog posts from datadice.

Further Links

This post is part of our new Google Cloud series in the Data School of datadice. We will be sharing the best insights that we’ve learned through years of creating data solutions.

If you want to learn more about how to use Google Data Studio and take it to the next level in combination with BigQuery, check our Udemy course here.

If you are looking for help to set up a modern and cost-efficient data warehouse or analytical dashboards, send us an email to hello@datadice.io and we will schedule a call.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
datadice

datadice

254 Followers

Data Analytics Boutique, based in Coburg, DE. Building modern, customized and cloud-based data warehouse solutions. https://www.datadice.io/