Build a Machine Learning model using Google BigQuery (COVID-19 Dashboard)
Continuation of the ETL Data Pipeline Workflow
Recommended Prerequisites: Understanding of Python (pandas & working with dataframes), Jupyter Notebook, SQL table operations and GitHub
Codes can be found here.
This is a continuation of the COVID-19 dashboard I posted 2 weeks ago, this tutorial shows how to utilize built-in machine learning models provided by BigQuery in order to forecast the number of cases (confirmed & deaths) in the next 30 days using ARIMA (Auto Regressive Integrated Moving Average). Read more about ARIMA here. We’ll be walking through the model_update.py file in the repository.
- SETUP
- CREATE MODELS & GET PARAMETERS
- FORECAST FOR THE NEXT 30 DAYS
- ARIMA MODEL DASHBOARD
- NEXT STEPS
SETUP
Similar to the setup for the ETL project, follow the steps to obtain API keys for local deployment and instantiate the bigquery_client using the code below:
deployment = 'local' #local or cloudif deployment == 'cloud':
from pyspark.sql import SparkSession #ONlY FOR CLOUD DEPLOYMENT
#Start spark session
spark = SparkSession \
.builder \
.config("spark.jars.packages", "com.google.cloud.spark:spark-bigquery-with-dependencies_2.11:0.17.0")\
.master('yarn') \
.appName('spark-bigquery-ryder') \
.getOrCreate()
#Instantiate BigQuery client
bigquery_client = bigquery.Client() # Instantiates a client
#Instantiate Storage client
storage_client = storage.Client() # Instantiates a client
else:
#Set credentials for bigquery !FOR LOCAL ONLY, DON'T COPY TO PYSPARK
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]="covid-jul25-**************.json"
bigquery_client = bigquery.Client() # Instantiates a client#Set credentials for cloud storage
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]="covid-jul25-**************.json"
storage_client = storage.Client() # Instantiates a client
CREATE MODELS & GET PARAMETERS
CREATE OR REPLACE MODEL & ML.EVALUATE
Google’s documentation for ML models can be found HERE. First, put all values gathered from the last project (table temp_cases) into the table format we want (each row is a daily observation of deaths or confirmed cases)
CREATE OR REPLACE MODEL `covid-jul25.arimamodels.{case_type}_{state[0]}`
OPTIONS
(model_type = 'ARIMA',
time_series_timestamp_col = 'date1',
time_series_data_col = 'value',
time_series_id_col = 'statecounty'
) AS
SELECT date1, value, statecounty
FROM
`covid-jul25.usprojections.arimaformat`
WHERE state = '{state[1]}'
AND case_type = '{case_type}'
The code block above will create an ARIMA model for each state for each case type (deaths vs. confirmed). Note that each ARIMA model is for a time series (‘value’) labeled by time (‘date1’) with column ID (‘statecounty’) (statecounty field is in the form of state_name-county_name). By running the ARIMA this way, there will be an ARIMA model for each county for each state. The reason why I opted to do it this way is because:
- The process will take hour(s) to run on BigQuery for the whole of US because of the size of the data even though you could still run a model on the county level
- It is better for debugging purposes to see if the code breaks for any of the states or if the code times out
- Easier to view the models by states instead of the whole of US
By doing this, we have the following ARIMA models:
Please note confirmed_AK is the model while confirmed_AK_tbl is a table storing the parameters for the models for each county using ML.EVALUATE:
DROP TABLE IF EXISTS `covid-jul25.arimamodels.{case_type}_{state[0]}_tbl`;
CREATE TABLE `covid-jul25.arimamodels.{case_type}_{state[0]}_tbl` AS
SELECT * FROM ML.EVALUATE(MODEL `covid-jul25.arimamodels.{case_type}_{state[0]}`);
COMBINE all the parameters for ALL county into a single table confirmed_US using the wildcard * to SELECT from ALL the confirmed tables:
DELETE FROM `covid-jul25.arimamodels.{case_type}_US` WHERE True;
INSERT INTO `covid-jul25.arimamodels.{case_type}_US`
SELECT * FROM
`covid-jul25.arimamodels.{case_type}_*`
WHERE statecounty is not null;
FORECAST FOR THE NEXT 30 DAYS
ML.FORECAST
The ML.FORECAST code is embed in a user-defined function arima_forecast to make a 30-day prediction/horizon (STRUCT(30)) & with confidence level of 90% for each case type per state:
query_job = bigquery_client.query(arima_forecast('deaths',30,0.9,sorted_statedict))#INSIDE function arima_forecast:
ML.FORECAST(MODEL `covid-jul25.arimamodels.{case_type}_{state[1][0]}`, STRUCT({horizon} AS horizon, {confidence_level} AS confidence_level))
Results for each state then get combined into a single US table arimamodels.confirmed_US_forecast. Likewise for arimamodels.deaths_US_forecast. The rest of the notebook deals with slicing the forecast for the 7-day horizon and update the landing page for the dashboard.
ARIMA MODEL DASHBOARD
The final dashboard provides:
- Projections for both confirmed and deaths cases for the next 30-day. Depends on uses, the forecast horizon can be modified to different needs.
- Parameters on the county level
- A toggle for confirmed vs deaths and filter for states and counties for more convenient analysis
NEXT STEPS
- BigQuery has more to offer in terms of built-in models so it’s worth to build a model using all data collected (mobility & Rt)
- Discover strengths and weaknesses of the ARIMA model (time constraints, weekly seasonality, goodness of fit)