Part 1: Google BigQuery ML predicting cardiovascular disease risk using binary classification

James Matson
9 min readApr 2, 2023

--

Experimenting with the built-in ML abilities of Google BigQuery has been on my personal to-do list for a while now, so I finally decided to take the plunge. In doing so, I figured I’d bring everyone else along for the ride in case you were interested in — but hadn’t yet started — a similar journey.

The problem statement

In order to give myself something to work with, I decided to look at creating a model which can help to predict the risk of cardiovascular disease given a number of features like age, height, weight, blood pressure, smoker yes/no and so on. This is a useful test run because I already have a reasonable intuition about risk factors here, so when I get my model up and running as a prediction endpoint I can ramp up those ‘known’ risk factors and hopefully see the right prediction as a result.

The data set

In order to find a suitable dataset to work with, I turned to everyone’s favorite repository of data (and great ML tutorials!) Kaggle, where I found this dataset:

This dataset is great because everything is already numerical (ML training loves numbers, strings not so much) but I still wanted to put it into a notebook to look at the data in more detail. This is where Google Colab (https://colab.research.google.com/) is a fanstatic resource. Within moments you can have a Jupyter notebook up and running with various useful ML packages pre-installed.

So I uploaded my data to a new Colab notebook and started exploring.

Data wrangling

First thing I did was to alter the age column to be in years rather than days (just easier for me to work with) and then I ran describe to get a summary of the data

Straight away I could see some issues with the data. Some of the features just seemed a bit off. A height of 55cm for instance, is surely a data error, as is a blood pressure of 11500 (although, I’ve spent some time trying to debug Cloud Run deployments, and I’m pretty sure my blood pressure got up around there somewhere…).

I decided to take a graphical look at the distribution of values for some of these features by creating some histograms.

Age isn’t too bad, but the rest you can see have outliers — in some cases at both ends. Outliers aren’t great for most ML models, because they can negatively affect the statistical analysis and therefore, the accuracy of your model. So I’ve decided to remove them from specific features in the dataset. To do this, I’m using the ‘Z-score method’, a popular way of determining the probability of a value occurring within a normal distribution of values:

Once I’ve run this against my features, I can see via the histogram(s) that the distributions look much better (to be honest, the blood pressure high/low distribution still doesn’t look great, but I’ll keep it for the purpose of moving on with the demo)

A quick check of the summary stats shows most of the features now look much better (just, just ignore blood pressure for now, I’ve realised I probably needed to do more work there :D)

I think now we’re ready to move our ‘wrangled’ data set into BigQuery to turn it into a machine learning model!

Model building in BigQuery

First thing to do is get our data from our final notebook CSV into a BigQuery table. To do this, you can just use the colab python library to download your CSV, then create a BigQuery table based off the data. First I’ve created a new dataset called cardiovascular_disease, then created a new table called raw_data_wrangled based off the uploaded CSV file:

Before I create the model, I need to divide my data up into 3 sets that I’ll use at different stages during my ML journey. I’ll need the training data itself, then I’ll need some data set aside for evaluating the model, and finally a set of data for testing predictions. There are different ideas online as to the right ratio of this split, but I’m going to work with 80% training data, 10% evaluation and 10% test. I’ll paste the BigQuery SQL below, then go over what I’m doing:

# get 10% of the table data for validation
CREATE OR REPLACE TABLE cardiovascular_disease.validation_data
AS
SELECT * FROM cardiovascular_disease.raw_data_wrangled WHERE rand() < 0.1;


# get 10% of the table data for test while avoiding rows already in validation data
CREATE OR REPLACE TABLE cardiovascular_disease.test_data
AS
SELECT * FROM cardiovascular_disease.raw_data_wrangled WHERE rand() < 0.1
AND id NOT IN (SELECT id FROM `gcp-xxx-494d.cardiovascular_disease.validation_data`);


# remove other data sets from the main table
DELETE FROM `gcp-xxx-494d.cardiovascular_disease.raw_data_wrangled`
WHERE id IN (SELECT id from `gcp-xxx-494d.cardiovascular_disease.test_data`);


DELETE FROM `gcp-xxx-494d.cardiovascular_disease.raw_data_wrangled`
WHERE id IN (SELECT id from `gcp-xxx-494d.cardiovascular_disease.validation_data`);

So I’ve created 2 new tables, validation_data and test_data. Each is created by selecting out 10% of the existing data set randomly (although I’ve since realised the 10% test data isn’t exactly that percentage, as I selected it from the ‘remaining’ 90% of rows in the main table. It’ll do for my purposes though).

Finally I’ve removed the rows from the main table that now exist in my validation and test tables. Okay, time to create my ML model!

Now the choice of model in this instance is a binary classification model. This model allows us to predict a binary value using our availably features (meaning a ‘yes/no’, ‘on/off’ type value. This suits our question of does the patient have a risk of cardiovascular disease — yes/no). Below is my create model statement using the CREATE MODEL function:

# Train model
CREATE OR REPLACE MODEL `cardiovascular_disease.binary_model_01`
OPTIONS
(model_type='logistic_reg', labels=['risk']) AS
SELECT age,--weight,height,gender,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,
cardio AS risk
FROM `cardiovascular_disease.raw_data_wrangled`;

I’ve named the model xxx_’01' because I expect to go through a few iterations before the final product. The model type is logistic_reg, as this will allow us to predict a binary outcome, and my predicted value is ‘risk’ (which in our data sets is labelled as cardio 0/1 (0 meaning the patient didn’t have cardiovascular disease, 1 meaning they did). While I’ve got a few features to work with, I’m including only age at this point, because I want to introduce features gradually to see what affect they have.

Okay, so I’ve run my model through a training iteration, let’s have a look at the result!

Okay, there’s a bit going on there, but the figure I’m most interested in is the ROC (receiver operating characteristic) curve. It’s 0.6, out of a range of 0 to 1. Higher numbers here are good, because they indicate we’re getting a greater number of correct matches against the data when training. An ROC of 0.6 isn’t that great. It’s easier to put in more context when you also take a look at the confusion matrix:

So we predicted 100% of our results for patients that did have cardiovascular disease, but we also mis-identified all the patients that didn’t have the disease, as having it. Hmmm. This might have something to do with the threshold above which we consider a value as positive (1). If you look at the previous screenshot, you’ll see the slider that shows the positive class threshold as 0.4602. You can also see the recall is at 1 (100%). So our recall is fantastic, but because anything with a confidence of .46 or higher is considered positive, we’re basically marking everything positive.

BigQuery lets you adjust the positive class threshold in order to see its effect on the confusion matrix. I’ve pushed the threshold up to 0.5019, and now I am getting correct predictions of 0 and 1 (yay!) but my recall is down quite a bit as I’m also getting false predictions of 0 and 1.

My next steps after this, were to continue to create new revisions and add new features to my model, train and review. I ended up at revision 5, with all my features added (some gave a greater ‘bump’ than others to my ROC, and that’s part of understanding which features are more important than others for your model).

So where did I end up? Let’s look at the final model output:

Just shy of .8! Not too bad at all. Not perfect by any means, but a decent first crack. If we set our positive threshold to 0.5, let’s check our confusion matrix:

Okay, now we’re looking better. Above 70% correct prediction of both 0 and 1 values. If I were looking to put this model into production, I’d spent a lot more time tuning, reviewing and data wrangling etc. but for now I’m going ahead to evaluation.

# evaluate
SELECT
roc_auc,
CASE
WHEN roc_auc > .9 THEN 'excellent'
WHEN roc_auc > .8 THEN 'very_good'
WHEN roc_auc > .7 THEN 'good'
WHEN roc_auc > .6 THEN 'fair'
WHEN roc_auc > .5 THEN 'improve_model'
ELSE
'poor'
END
AS model_quality
FROM ML.EVALUATE(MODEL `cardiovascular_disease.binary_model_05`,
(SELECT age,weight,height,gender,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,
cardio AS risk
FROM `cardiovascular_disease.validation_data`));

Here I’m using the ML.EVALUATE function in BigQuery to calculate the ROC curve value, but this time against our validation_data table. I’ve also put a case statement to convert the numeric value into a (slightly subjective) text indication of how well my model is doing.

Okay! Our model is good. Not great, but good. I can live with good. Good on a Sunday afternoon is wonderful :).

As a final step, I’ll run the ML.PREDICT function against my 10% of test data I separated out early in the process. I’m wrapping the prediction in a SELECT COUNT(*) so I can get a count of records where the predicted value matches the actual value (It’s important here to review how to use the ML.PREDICT function, as the column names are important, e.g. to show the predicted value, you must used predicted_<label> where <label> is the value you want to predict, in this case ‘risk’).

# test out of 4944 rows, predicted 72.9975728%
SELECT COUNT(*) FROM (
SELECT predicted_risk, risk as actual_risk
FROM ML.PREDICT(MODEL `cardiovascular_disease.binary_model_05`,
(
SELECT
age, weight, height, gender, ap_hi, ap_lo, cholesterol,gluc,smoke,alco,active,
cardio AS risk
FROM `cardiovascular_disease.test_data`
))) WHERE predicted_risk = actual_risk;

Alright, so out of 4944 records in my test table, 3609 of them were predicted correctly, a ratio of 72.9%. Again, I wouldn’t use this model in production, but for a demo I think that’s not too shabby.

Hopefully you found this as interesting as I did, I think there’s plenty of evidence here that BigQuery is a useful ML workbench, and something I’ll certainly be looking into further along with competing workbenches like AWS Sagemaker, Azure ML Studio and so on.

I’ll be doing a follow-up article on taking my best model, dockerising it and running it as a prediction endpoint via an API, so please keep an eye out for that, and subscribe so you don’t miss out! :)

Until then!

--

--

James Matson

DevOps Lead, C# and Python enthusiast, Writer, AWS Community Builder, Microsoft PowerApps Champion, AI/ML Tinkerer