When are the bugs coming? ML modelling in BigQuery to predict the next bug outbreak
Firstly, Happy New Year 🎉 As summer in Australia kicks into gear we’ve started 2020 off with a bit of fun, attempting to answer the question everyone is thinking in Australia this time of year (especially my wife and kids!) which is — when are the bugs coming?! That is predicting cockroach and mosquito (aka mozzie) outbreaks.
The bug prediction project actually came together back in October. Plenty of time for us to research, iterate, and test, before the Summer season — and accordingly the bugs — got into gear. In this blog post we’ll share the technical solution we followed, and what a high level solution would look like.
All using Google Cloud BigQuery, taking advantage of the BigQuery Geography, Analytic and ML Functions.
Brief introductions, for those of you who don’t know us
I’m Drew. I’ve been working at Google for sometime now. My current role is as a Customer Solutions Engineer, helping our Advertisers get the best out of their websites, apps, and data. All from my desk in beautiful Sydney.
I’m Dennis. I’ve been an Analyst and Consultant at Google in Sydney for the last 6ish years working on insights, strategy and solutions to advertising client problems using data and in this role have developed a real passion for ML.
Weather you say?
Yep, weather! Weather patterns became our main feature in this model, as they have been shown to have an effect on insect population. Using historical weather data we can look for correlations between patterns over time and bug outbreaks. Considering temperature, precipitation / rainfall, humidity… (and the potential for much more).
Historical weather data
As this is a fun experiment we’ll play with the Global Historical Climate Network (GHCN) BigQuery public dataset. The NCAR weather data contains reports from over 80,000 stations in 180 countries, going back over a decade. It’s even been quality-checked. Available in BigQuery ready to go. Perfect.
Also, as we only care about the weather trends in Australia we’ll massively reduce this set, only listing local weather stations (although it would be fun to look at this globally one day).
…and voilà, an australia_historical_weather data table, size of approx 1.36 MB / 17,088 rows, ready to play with.
Australia state boundaries
Typically weather data is associated with a weather station’s lat / long location. Which is the case for the NCAR data. So the first real challenge is mapping and rolling these lat / long locations up to more meaningful geo locations i.e like a town, city or state.
In this post we’ll choose state level (given the trends information soon to be discussed), to make bug outbreak predictions.
The Australian Bureau of Statistics (abs.gov.au) website maintains “Statistical Area” locations. Using the ABS Maps we can download the “State and Territory (STE) ASGS Ed 2016 Digital Boundaries in ESRI Shapefile Format” .shp file, which details the boundaries for all the Australian states (similar to this screenshot of a state boundary in Google Maps).
This data also includes the Area Size (km2) of each state. Nice! Let’s be sure to keep this as well, it’ll be a useful feature later on 👍.
Once downloaded we can convert the Shapefile (.shp) into a .csv that BigQuery can consume. An example tool to use this would be ogr2ogr, e.g…
ogr2ogr -f csv -dialect sqlite -sql “select AsGeoJSON(geometry) AS geom, * from SA4_2016_AUST” SA4_2016_AUST.csv SA4_2016_AUST.shp
The awesome medium post by Lak Lakshmanan on how to load geographic data like shapefiles into BigQuery covers this well.
Once the CSV is available it can be uploaded into BigQuery, creating an australia_state_boundaries table to use, the size of 87.9 MB / 107 rows.
Tip: When you upload the data into CSV, don’t use the autodetect. Instead, specify a schema so that the geometry column can be a GEOGRAPHY type. It’ll avoid us needing to use ST_GeogFromGeoJson in the query below.
Mapping weather data to a state
Now we have the australia_state_boundaries table to play with, this is the easy bit. The Geography Functions of BigQuery simplify things, allowing us to aggregate weather data that is ST_WITHIN a state boundary.
WITH australia_state_boundaries AS (
geom AS polygon,
STATE_NAME_2016 AS stateLocation
AVG(avgTemperature) AS avgTemperature,
AVG(avgTd) AS avgTd,
MAX(maxTemperature) AS maxTemperature,
MIN(minTemperature) AS minTemperature,
AVG(precipitation) AS precipitation
GROUP BY sa4Location, day
At the same creating max, min and avg values in the SQL. Let’s call this new table australia_state_historical_weather.
Google Trends for bug outbreaks
Google Trends is a tool to analyze Google Search queries over time. Looking at the term “Cockroach” it’s clear the trends data is a useful indicator of outbreaks, you can see (in this screenshot taken from Google Tends) the searches for the topic build up as Australia approaches summer.
Accordingly this is the value we’ll try to predict i.e. the label used in our ML Model. The trends tool allows you to filter by state, and download the information by CSV.
Join this with the australia_state_historical_weather table, and we now have a data set containing the last few years of weather and bug trends.
On a quick side note. The trends tool also has a “Related queries” section, showing what users also searched for (an example screenshot)…
but whatever you do don’t look in this section when on the Cockroach trends page… and when you don’t look at it, definitely don’t see the related query for Cockroach Milk! and whatever you do, don’t let curiosity get the better of you and search to find out it’s a new health craze 🙈 ! I’ll never look at the world the same way 🤮.
Training our “buggy” ML Model
We’ll break the dataset into a number of windowed features, aggregating behaviour patterns over time. To understand this concept more check out my structuring data to predict the future post.
As the data is reasonably simple we can do it all in BigQuery. To do these we’ll make use of the Analytic Function of BigQuery PARTITION BY, to aggregate data.
As an example, the SQL below creates a dataset for mosquitos.
- It aggregates 1 week of data each day, creating features of past patterns. PARTITION BY sa4Location ORDER BY day ROWS BETWEEN 7 PRECEDING AND 0 FOLLOWING.
- It also looks 4 weeks forward, to create a label that will train the model at looking into the future. PARTITION BY sa4Location ORDER BY day ROWS BETWEEN 29 FOLLOWING AND 29 FOLLOWING.
EXTRACT(MONTH FROM day) AS month,
EXTRACT(DAYOFYEAR FROM day) AS dayOfYear,
EXTRACT(WEEK FROM day) AS weekOfYear, — Label
SUM(mosquitoSearchVolume) OVER (PARTITION BY sa4Location ORDER BY day ROWS BETWEEN 29 FOLLOWING AND 29 FOLLOWING) AS mosquitoSearchVolume4WeekLabel, — 1 Week Volume Features
mosquitoSearchVolume / AVG(mosquitoSearchVolume) OVER (PARTITION BY sa4Location ORDER BY day ROWS BETWEEN 7 PRECEDING AND 0 FOLLOWING) AS percentageChangeFromAvgMosquitoSearchVolume1Week,
SUM(mosquitoSearchVolume) OVER (PARTITION BY sa4Location ORDER BY day ROWS BETWEEN 7 PRECEDING AND 0 FOLLOWING) / areaSize AS mosquitoSearchVolumePerAreaSize1Week, — 1 Week Weather Features
AVG(avgTemperature) OVER (PARTITION BY sa4Location ORDER BY day ROWS BETWEEN 7 PRECEDING AND 0 FOLLOWING) AS avgTemperature1Week,
AVG(avgTd) OVER (PARTITION BY sa4Location ORDER BY day ROWS BETWEEN 7 PRECEDING AND 0 FOLLOWING) AS avgTd1Week,
MAX(maxTemperature) OVER (PARTITION BY sa4Location ORDER BY day ROWS BETWEEN 7 PRECEDING AND 0 FOLLOWING) AS maxTemperature1Week,
MIN(minTemperature) OVER (PARTITION BY sa4Location ORDER BY day ROWS BETWEEN 7 PRECEDING AND 0 FOLLOWING) AS minTemperature1Week,
AVG(precipitation) OVER (PARTITION BY sa4Location ORDER BY day ROWS BETWEEN 7 PRECEDING AND 0 FOLLOWING) AS avgPrecipitation1Week, — 1 Week Area Size Features (remember earlier we decided to keep this!)
AVG(avgTemperature) OVER (PARTITION BY sa4Location ORDER BY day ROWS BETWEEN 7 PRECEDING AND 0 FOLLOWING) / areaSize AS avgTemperaturePerAreaSize1Week,
AVG(avgTd) OVER (PARTITION BY sa4Location ORDER BY day ROWS BETWEEN 7 PRECEDING AND 0 FOLLOWING) / areaSize AS avgTdPerAreaSize1Week,
AVG(precipitation) OVER (PARTITION BY sa4Location ORDER BY day ROWS BETWEEN 7 PRECEDING AND 0 FOLLOWING) / areaSize AS avgPrecipitationPerAreaSize1WeekFROM…
A few tips…
- Depending on how the Google Trends data is looking, it may be useful to remove the 99% quantile to avoid any crazy spikes that will confuse the ML Model (APPROX_QUANTILES is great for this).
- Don’t just stop at aggregating back 1 week at a time (as per the SQL). The ML Model may find patterns in weather aggregations over other time periods e.g 12 weeks.
- In the SQL you’ll see there is opportunity to create and play with new features e.g MAX and MIN Precipitation. See what you can find to improve the ML Model’s accuracy.
- Remember you are aggregating past data and need to ensure the oldest date being considered has data before it to aggregate e.g for 1 week windows in the WHERE clause ensure day > (SELECT DATE_ADD(MIN(day), INTERVAL 7 DAY).
That’s the ML Dataset ready. A new table australia_state_historical_weather_dataset ready to train an ML model. Yay.
BigQuery ML can be used on it to build a model, and start making predictions. With a few lines of SQL (starting with CREATE MODEL) you’ll have a model to play with.
CREATE OR REPLACE MODEL `[project].when_are_the_bugs_coming`
input_label_cols=[‘mosquitoSearchVolume4WeekLabel’]) ASSELECT *
That’s the model created. Now you can create a snapshot of australia_state_historical_weather_dataset containing fresh data in a week, and start making predictions.
mosquitoSearchVolume4WeekLabel AS prediction
(SELECT * EXCEPT mosquitoSearchVolume4WeekLabel
Let us know how you get on.
Thanks for reading.