Create Models with SQL
A Quick Introduction to Using Boosted Trees on Google Cloud BigQuery ML
I have started an online GCP ML course that introduces Google Cloud Platform products related to Machine Learning and Big Data. One module of the MOOC mentions the BigQuery ML, the tool which eases the use of ML algorithms for SQL practitioners and data analysts by avoiding the complexity of using multiple products for data storage, preprocessing, and model building. BigQuery offers the user a platform for both storing and querying the data, and a great interactive environment to experiment with ML models.
It is accessible via several options such as:
- the Google Cloud Console
- - the
bq
command-line tool - the BigQuery REST API
- an external tool such as a Jupyter notebook or business intelligence platform
I have decided to make a quick implementation of what we saw in the story Implementation of XGBoost on Income data on Hypatai, on BigQueryML, since it supports the SQL statement CREATE MODEL for boosted tree models using XGBoost. My intention was to reproduce the same model we created in that story using the console option in the above list, however, I could not do so but went so close since either the functionality has not reached a certain point or I still need time to fully adopt and get familiar with BQML.
The data and the scripts I used during the experimentation phase, could be found on my GitHub page, under the BigQueryML folder. The input data I used is the same preprocessed data used for training in the Implementation of XGBoost on Income data.
Let’s now go over the steps I followed:
The step where I executed the CREATE OR REPLACE MODEL script on the merged train and test tables:
Let’s park at this step for a while and go over the above query and try to understand it more.
- The reason for merging train and test tables (the query block starting from line 16) is that the OPTIONS section already takes INPUT_LABEL_COLS to specify the target column, and we cannot provide two different X and y sets to the training script here.
- Remember, my goal was to reproduce the model generated in the Implementation of XGBoost on Income data. So I could find a way to keep the MAX_TREE_DEPTH, LEARN_RATE, MIN_TREE_CHILD_WEIGHT, MAX_ITERATIONS as they were selected in that story.
- I also succeeded in splitting the training and evaluation sets the same way I did in the Implementation of XGBoost on Income data, namely, I could manage using the first 80% of the data coming into the CREATE OR REPLACE MODEL step for training and 20% of it for evaluation. I did it by using row_num as my DATA_SPLIT_COL, assigning SEQ to the DATA_SPLIT_METHOD parameter (meaning it will use the ordered row_num column to split and do it by the sequential values of it), and 0.2 for DATA_SPLIT_EVAL_FRACTION meaning the last 20% of the data would be used for evaluation.
- I could not find a way to specify an EARLY_STOPPING_ROUND-like parameter and set it to 20, however, after setting EARLY_STOP to TRUE and MIN_REL_PROGRESS to 0.001 I forced each iteration to reduce the loss by 0.1% for training to continue. However, it is not the same thing as setting a number for early stepping rounds, so I realized that I was not going to obtain the same model, but continued anyway.
- After learning about this loss decrease parameter, I noticed that I could not give an EVAL_METRIC-like parameter and set it to AUC, seemed like the ML query would use log-loss as the evaluation metric to continue its iterations by default.
- I could not find a SEED-like parameter to pin📌 the results each time we train the model.
- Although I would not be able to reproduce the model in my other story on Hypatai, I was curious about the outputs and ran the above create model script. Let’s see the results in the following screenshots.
Please leave a comment under this story and tell me about your experience with the BigQuery ML tool. Learning is a mutual process, so your recommendations on using this and other GCP tools are very well welcomed. 👩🏻 🌝
Happy SqMLs! 😁