Build a Machine Learning model using Google BigQuery (COVID-19 Dashboard)

4 min readSep 8, 2020

Continuation of the ETL Data Pipeline Workflow

Recommended Prerequisites: Understanding of Python (pandas & working with dataframes), Jupyter Notebook, SQL table operations and GitHub

Codes can be found here.

This is a continuation of the COVID-19 dashboard I posted 2 weeks ago, this tutorial shows how to utilize built-in machine learning models provided by BigQuery in order to forecast the number of cases (confirmed & deaths) in the next 30 days using ARIMA (Auto Regressive Integrated Moving Average). Read more about ARIMA here. We’ll be walking through the model_update.py file in the repository.

SETUP
CREATE MODELS & GET PARAMETERS
FORECAST FOR THE NEXT 30 DAYS
ARIMA MODEL DASHBOARD
NEXT STEPS

SETUP

Similar to the setup for the ETL project, follow the steps to obtain API keys for local deployment and instantiate the bigquery_client using the code below:

deployment = 'local' #local or cloudif deployment == 'cloud':
    from pyspark.sql import SparkSession #ONlY FOR CLOUD DEPLOYMENT
    #Start spark session
    spark = SparkSession \
        .builder \
        .config("spark.jars.packages", "com.google.cloud.spark:spark-bigquery-with-dependencies_2.11:0.17.0")\
        .master('yarn') \
        .appName('spark-bigquery-ryder') \
        .getOrCreate()
    
    #Instantiate BigQuery client
    bigquery_client = bigquery.Client() # Instantiates a client
    #Instantiate Storage client
    storage_client = storage.Client() # Instantiates a client
    
else:
    #Set credentials for bigquery !FOR LOCAL ONLY, DON'T COPY TO PYSPARK
    os.environ["GOOGLE_APPLICATION_CREDENTIALS"]="covid-jul25-**************.json"
    bigquery_client = bigquery.Client() # Instantiates a client#Set credentials for cloud storage
    os.environ["GOOGLE_APPLICATION_CREDENTIALS"]="covid-jul25-**************.json"
    storage_client = storage.Client() # Instantiates a client

CREATE MODELS & GET PARAMETERS

CREATE OR REPLACE MODEL & ML.EVALUATE

Google’s documentation for ML models can be found HERE. First, put all values gathered from the last project (table temp_cases) into the table format we want (each row is a daily observation of deaths or confirmed cases)

CREATE OR REPLACE MODEL `covid-jul25.arimamodels.{case_type}_{state[0]}`
    OPTIONS
    (model_type = 'ARIMA',
    time_series_timestamp_col = 'date1',
    time_series_data_col = 'value',
    time_series_id_col = 'statecounty'
    ) AS
    SELECT date1, value, statecounty
    FROM
    `covid-jul25.usprojections.arimaformat`
    WHERE state = '{state[1]}'
    AND case_type = '{case_type}'

The code block above will create an ARIMA model for each state for each case type (deaths vs. confirmed). Note that each ARIMA model is for a time series (‘value’) labeled by time (‘date1’) with column ID (‘statecounty’) (statecounty field is in the form of state_name-county_name). By running the ARIMA this way, there will be an ARIMA model for each county for each state. The reason why I opted to do it this way is because:

The process will take hour(s) to run on BigQuery for the whole of US because of the size of the data even though you could still run a model on the county level
It is better for debugging purposes to see if the code breaks for any of the states or if the code times out
Easier to view the models by states instead of the whole of US

By doing this, we have the following ARIMA models:

Please note confirmed_AK is the model while confirmed_AK_tbl is a table storing the parameters for the models for each county using ML.EVALUATE:

DROP TABLE IF EXISTS `covid-jul25.arimamodels.{case_type}_{state[0]}_tbl`;
    CREATE TABLE `covid-jul25.arimamodels.{case_type}_{state[0]}_tbl` AS
    SELECT * FROM ML.EVALUATE(MODEL `covid-jul25.arimamodels.{case_type}_{state[0]}`);

COMBINE all the parameters for ALL county into a single table confirmed_US using the wildcard * to SELECT from ALL the confirmed tables:

DELETE FROM `covid-jul25.arimamodels.{case_type}_US` WHERE True;
INSERT INTO `covid-jul25.arimamodels.{case_type}_US`
SELECT * FROM
`covid-jul25.arimamodels.{case_type}_*`
WHERE statecounty is not null;

FORECAST FOR THE NEXT 30 DAYS

ML.FORECAST

The ML.FORECAST code is embed in a user-defined function arima_forecast to make a 30-day prediction/horizon (STRUCT(30)) & with confidence level of 90% for each case type per state:

query_job = bigquery_client.query(arima_forecast('deaths',30,0.9,sorted_statedict))#INSIDE function arima_forecast:
ML.FORECAST(MODEL `covid-jul25.arimamodels.{case_type}_{state[1][0]}`, STRUCT({horizon} AS horizon, {confidence_level} AS confidence_level))

Results for each state then get combined into a single US table arimamodels.confirmed_US_forecast. Likewise for arimamodels.deaths_US_forecast. The rest of the notebook deals with slicing the forecast for the 7-day horizon and update the landing page for the dashboard.

ARIMA MODEL DASHBOARD

The final dashboard provides:

Projections for both confirmed and deaths cases for the next 30-day. Depends on uses, the forecast horizon can be modified to different needs.
Parameters on the county level
A toggle for confirmed vs deaths and filter for states and counties for more convenient analysis

NEXT STEPS

BigQuery has more to offer in terms of built-in models so it’s worth to build a model using all data collected (mobility & Rt)
Discover strengths and weaknesses of the ARIMA model (time constraints, weekly seasonality, goodness of fit)