Create Models with SQL

A Quick Introduction to Using Boosted Trees on Google Cloud BigQuery ML

Cansu Ergün
HYPATAI
5 min readFeb 15, 2021

--

My Barney Stinson way of inventing new words :) Of course, they already came up with the idea: BQML stands for BigQuery ML, the abbreviation is frequently used on standard GCP BigQuery ML docs.

I have started an online GCP ML course that introduces Google Cloud Platform products related to Machine Learning and Big Data. One module of the MOOC mentions the BigQuery ML, the tool which eases the use of ML algorithms for SQL practitioners and data analysts by avoiding the complexity of using multiple products for data storage, preprocessing, and model building. BigQuery offers the user a platform for both storing and querying the data, and a great interactive environment to experiment with ML models.

It is accessible via several options such as:

  • the Google Cloud Console
  • - the bq command-line tool
  • the BigQuery REST API
  • an external tool such as a Jupyter notebook or business intelligence platform

I have decided to make a quick implementation of what we saw in the story Implementation of XGBoost on Income data on Hypatai, on BigQueryML, since it supports the SQL statement CREATE MODEL for boosted tree models using XGBoost. My intention was to reproduce the same model we created in that story using the console option in the above list, however, I could not do so but went so close since either the functionality has not reached a certain point or I still need time to fully adopt and get familiar with BQML.

The data and the scripts I used during the experimentation phase, could be found on my GitHub page, under the BigQueryML folder. The input data I used is the same preprocessed data used for training in the Implementation of XGBoost on Income data.

Let’s now go over the steps I followed:

Open the BigQuery platform.
Select the new project button.
Create a new project.
Select CREATE DATASET for getting a new dataset under the project.
Create the dataset with your preferred properties.
Select Create table to add a table into the dataset.
Let’s look inside the original XGBOOST_y_train.csv file to remember how it looks like.
Create the table using the XGBOOST_y_train.csv file. (The same step applied for XGBOOST_X_train.csv as well.)
Let’s look inside the uploaded table. Notice that the original row order has changed. BigQuery puts the records with income_level = 1 to higher-order rows. Since this will happen for XGBOOST_X_train.csv as well, we will ask for help from the row_num column in the table and use that column to join the train and test sets later.

The step where I executed the CREATE OR REPLACE MODEL script on the merged train and test tables:

The part where I built the XGBoost model using SQL queries.

Let’s park at this step for a while and go over the above query and try to understand it more.

  1. The reason for merging train and test tables (the query block starting from line 16) is that the OPTIONS section already takes INPUT_LABEL_COLS to specify the target column, and we cannot provide two different X and y sets to the training script here.
  2. Remember, my goal was to reproduce the model generated in the Implementation of XGBoost on Income data. So I could find a way to keep the MAX_TREE_DEPTH, LEARN_RATE, MIN_TREE_CHILD_WEIGHT, MAX_ITERATIONS as they were selected in that story.
  3. I also succeeded in splitting the training and evaluation sets the same way I did in the Implementation of XGBoost on Income data, namely, I could manage using the first 80% of the data coming into the CREATE OR REPLACE MODEL step for training and 20% of it for evaluation. I did it by using row_num as my DATA_SPLIT_COL, assigning SEQ to the DATA_SPLIT_METHOD parameter (meaning it will use the ordered row_num column to split and do it by the sequential values of it), and 0.2 for DATA_SPLIT_EVAL_FRACTION meaning the last 20% of the data would be used for evaluation.
  4. I could not find a way to specify an EARLY_STOPPING_ROUND-like parameter and set it to 20, however, after setting EARLY_STOP to TRUE and MIN_REL_PROGRESS to 0.001 I forced each iteration to reduce the loss by 0.1% for training to continue. However, it is not the same thing as setting a number for early stepping rounds, so I realized that I was not going to obtain the same model, but continued anyway.
  5. After learning about this loss decrease parameter, I noticed that I could not give an EVAL_METRIC-like parameter and set it to AUC, seemed like the ML query would use log-loss as the evaluation metric to continue its iterations by default.
  6. I could not find a SEED-like parameter to pin📌 the results each time we train the model.
  7. Although I would not be able to reproduce the model in my other story on Hypatai, I was curious about the outputs and ran the above create model script. Let’s see the results in the following screenshots.
Execution details showing time elapsed in each stage and the number of completed iterations. (Notice we came close to the completed iterations in Implementation of XGBoost on Income data )
Model details.
Results also provided temp tables used for training and evaluating the model, with hyperparameters used during the training phase.
A screenshot of the temporary train data, row_num column is shown but don’t worry it was not included as a feature in the training :)
Let’s double-check that the temporary training data is the first 80% of the table we provided to our model.
Results show that there is nothing unexpected here. Overall data entering the create table script had 32.5K rows.
Loss evolution during the training phase
By quick trial and error, I obtained 0.3039 as a valid threshold to assign classes. (Notice, it is far away from what we got in the story Interpreting Performance Evaluation Metrics for Classification Problems, the value we got there was 0.21. )
Evaluation curves. The obtained value of 0.928 AUC seems to be a good result.
Let’s check the confusion matrix obtained with the threshold value we have at hand (0.3039).
The query for listing the top features in our model.
Top features used in the model.
Query to get the performance metric on the temporary evaluation data set using the selected threshold.
Notice we get the same results where we selected 0.30309 as our threshold above screenshots.
It is also possible to get the raw data required to construct our performance graphs such as ROC curves.
By plotting for example recall vs false_positive_rate, we can get the ROC curve from this data.

Please leave a comment under this story and tell me about your experience with the BigQuery ML tool. Learning is a mutual process, so your recommendations on using this and other GCP tools are very well welcomed. 👩🏻 🌝

Happy SqMLs! 😁

--

--