Analyzing COVID-19 with BigQuery
Using the free BigQuery public dataset for analysis and planning
Johns Hopkins maintains a dataset of COVID-19 confirmed cases and have made it free in the form of a CSV file for academic and research use. In an effort to make the data easier to query and analyze, Google Cloud is making it publicly available in BigQuery. BigQuery has a sandbox through which you can try it out without having to sign up for Google Cloud (or having to provide a credit card).
Note: JHU changed the licensing terms on their dataset and so this data is no longer available, so the queries in this blog post has been modified to use the newer open dataset at
bigquery-public-data.covid19_open_data.compatibility_view
I’ve not updated the output snapshots, etc.
As a special case, this BigQuery dataset is free to query even outside the free tier (until Sep 2020). If you join the COVID-19 data against any other datasets, the first 1 TB of querying per month of those other datasets is free and included in the sandbox program.
Querying and plotting the data
Go to https://console.cloud.google.com/bigquery and type in:
SELECT
date, SUM(confirmed) num_reports
FROM `bigquery-public-data.covid19_open_data.compatibility_view`
WHERE country_region = 'Italy'
GROUP BY date
HAVING num_reports IS NOT NULL
ORDER BY date ASC
The first few rows of the query returned:
while the last few rows were:
showing the rapid growth of confirmed cases in Italy. Click on the “Explore with Data Studio” button and in Data Studio, do the following two steps:
- Select the icon for “Smoothed Time Series Chart” (see red oval in the graph below)
- Change the metric to “num_reports”
You will get a graph like this:
The data is global, so we can do the same thing for Singapore, but here let’s use the geospatial capability of BigQuery:
SELECT
date, SUM(confirmed) num_reports
FROM `bigquery-public-data.covid19_open_data.compatibility_view`
WHERE ST_Distance(ST_GeogPoint(longitude, latitude),
ST_GeogPoint(103.8, 1.4)) < 200*1000 -- 200km
GROUP BY date
HAVING num_reports IS NOT NULL AND num_reports > 0
ORDER BY date ASC
This shows a much slower growth (note the Y-axis):
Mapping the data with BigQuery GeoViz
To get county-level data (US-only), you can use the fips code in the data. Go to BigQuery GeoViz at https://bigquerygeoviz.appspot.com/ and type in the following query to get the latest confirmed numbers from each county:
WITH cases_by_county AS (
SELECT
fips,
ARRAY_AGG(confirmed ORDER BY date DESC LIMIT 1)[OFFSET(0)] as num_cases
FROM `bigquery-public-data.covid19_open_data.compatibility_view`
WHERE confirmed IS NOT NULL
GROUP BY fips
)
SELECT
num_cases,
ST_CENTROID(county_geom) AS map_marker
FROM
cases_by_county
JOIN
`bigquery-public-data`.geo_us_boundaries.counties
ON fips = county_fips_code
The map starts out looking like this (i.e, just a small red point at each reporting county’s center point):
Go into the Style section and select these:
- Change fillColor to be data-driven and change the dialog as follows (the domain is logarithmic, and the range is hue-based):
- In the fillOpacity dialog, change the value to be 0.4.
- Change the circleRadius to also be logarithmic and data-driven:
to get a better looking plot where the radius depends on the number of confirmed cases.
Extrapolating the trend
BigQuery incorporates machine learning algorithms and time-series prediction methods. These are not epidemiological models, just extrapolations of current trends. However, even extrapolations can be useful for planning purposes. So, let’s see how to extrapolate current trends.
I’ll first illustrate this for Japan (see this article for a primer on how to do time series prediction in BigQuery). First create a dataset named advdata to hold the output model. Then, type in:
CREATE OR REPLACE MODEL advdata.numreports_forecast
OPTIONS(model_type='ARIMA',
time_series_data_col='num_reports',
time_series_timestamp_col='date') AS
SELECT
date, SUM(confirmed) num_reports
FROM `bigquery-public-data.covid19_open_data.compatibility_view`
WHERE country_region = 'Japan' AND confirmed IS NOT NULL
GROUP BY date
ORDER BY date ASC
Note that at the time I’m writing this, the ARIMA model is in alpha. Contact your Google Cloud sales rep to get your project whitelisted.
This creates a time series extrapolation model. We can now get a 14-day forecast from it by doing:
SELECT * FROM
ML.FORECAST(MODEL advdata.numreports_forecast,
STRUCT(14 AS horizon, 0.9 AS confidence_level))
This yields (again, please note that this is simply an extrapolation of the current trend — for true predictions, please consult a epidemiologist!):
We picked Japan for a reason — because we’re doing time-series predictions using the ARIMA model, we needed a stable/periodic trend. The US is exhibiting exponential growth, and is not a good candidate for ARIMA if we use the raw values.
To use the ARIMA model for places like the US, we should take the log of the number of reports and fit a trend to it. Here’s a query that will do that:
CREATE OR REPLACE MODEL advdata.numreports_forecast
OPTIONS(model_type='ARIMA',
time_series_data_col='log_num_reports',
time_series_timestamp_col='date') AS
SELECT
date, LOG(SUM(confirmed)) log_num_reports
FROM `bigquery-public-data.covid19_open_data.compatibility_view`
WHERE country_region = 'US' AND confirmed IS NOT NULL
GROUP BY date
ORDER BY date ASC
;
SELECT
* , EXP(forecast_value) AS forecast_numreports
FROM ML.FORECAST(MODEL advdata.numreports_forecast,
STRUCT(14 AS horizon, 0.9 AS confidence_level))
This yields:
Again, we wish to reiterate that this curve is simply an extrapolation of the current trend — for true predictions that take into account preventative measures, hospital capacity, community transmission, etc., you should consult an epidemiologist!
This is an extrapolation of historical trends, and does not take into account all the other factors that affect the number of confirmed COVID cases. It’s not meant to predict actual outcomes. Instead, it’s meant to extrapolate what will happen if things stay the same (and they won’t).
What happens if we go ahead and compare to actual outcomes anyway? The following analysis was done by Felipe Hoffa. Let’s check what happens when we augment the query to compare the ARIMA predictions vs the actual numbers for the last 6 days:
Note that these are not good as predictions. The predicted numbers quickly exhibit a 30% overestimation and the confidence bounds are quite wide. This makes sense because we are dealing with exponential growth (the standard deviation of an exponential is proportional to the final value), and the conditions on the ground are changing (hopefully, this is because shelter-in-place is starting to flatten the curve).
Bottom line: do not treat the predictions as predicting actual outcomes. They aren’t. These are extrapolations.
Getting a daily forecast
To get a daily forecast, we can create a script out of the two queries. This is as simple as writing the two SQL statements one after the other, making sure to end the first one with a semicolon. Then, click on “Schedule query” to run this every day:
Stay safe!
Closing comments and action items
- Please follow the guidance of public health authorities. In the US, that is the CDC. At the time this article was written, the primary guidance was to flatten the curve. Please do your part and stay home and away from large gatherings.
- Having ready access to the data supports a lot of analytical needs, but realize that it is quite possible that your interpretation of the data is wrong. Please consult experts in public health before making any life-or-death decisions or recommending any decision based on your analysis.
- The extrapolation trends are just that — extrapolations. The number of reports can be affected by more testing, and also by interventions such as “flatten the curve”. So, use the extrapolations for planning purposes (“what happens if the current trend persists, and nothing changes”) but realize that the world will change and the forecast will be wrong.
- BigQuery is free without a credit card (within the free tier). If you add a credit card make sure to set cost controls.
I don’t speak for my employer. This is not official Google work. Thanks to Felipe Hoffa for many insightful comments, and Amir Hormati for the GeoViz plot settings. Any errors that remain are mine, of course.