BigQuery GIS + ML on government open data

Use BigQuery to visualize, test, and predict citizen hypotheses using government open data.

Willis Zhang
8 min readMay 30, 2019

I’ve been having too much fun mashing together whatever public datasets I can get my hands on, with BigQuery being my one-stop shop for super quick analytics, machine learning, and GIS visualization.

Example BQ Geo Viz query on NYC bike shares

Putting on my citizen data scientist hat, the hypothesis I wanted to test was: “Does higher rent lead to more housing code violations?

Here, I used rent as a proxy for cost of living, where a higher cost of living may overburden owners and prevent them from keeping their properties up to code. I focused my initial study area to Montgomery County, MD in the DC suburbs. Here were the steps I took:

  1. Prep and load the data (boring but Dataprep makes it fun)

Rental info here and code violation data here. Just export and download the CSVs.

Now set up your GCP environment: go to the console and create a project with a billing account if you haven’t already (we should be well under the $300 in free trial credits). You may be enabling APIs along the way.

We’re actually going to clean our data in Dataprep first. Find Dataprep on the console left-side menu and click through the agreements.

Let’s start by wrangling the DHCA Rental Survey data. Import the CSV from your local machine or Cloud Storage, then create a new recipe.

We’ll end up creating two flows for each CSV
Dataprep: point-and-click ETL please because I don’t want to code

Here’s the recipe I used for the DHCA Rental Survey, essentially extracting the zip code, converting it into a string, and deleting missing rows:

Wrangle recipe in natural language, with wrangle file on the right
extractpositions col: {Facility Address} type: last count: 6
rename type: manual mapping: [{right_Facility Address},'zipcode']
textformat col: zipcode type: trimwhitespace
textformat col: {Facility Name} type: trimwhitespace
textformat col: {Facility Name} type: trimquotes
filter type: custom rowType: single row: ISMISMATCHED(zipcode, ['Zipcode']) action: Delete
drop col: {Facility Address} action: Drop
drop col: {Average Rent 2015} action: Drop
drop col: {Average Rent 2016} action: Drop
drop col: {Average Rent 2017} action: Drop
drop col: {Average Rent 2018} action: Drop
drop col: {Bedroom Types} action: Drop
textformat col: {Facility Name} type: trimwhitespace
settype col: zipcode type: String

Click “Run Job” and specify the following settings to keep it simple: “no compression, single file, with headers, with quotes, with delimiter: ,”.

Apache Beam without typing any JavaScript

You can actually look at the Dataflow (Apache Beam) job if you’re curious about the mapreduce steps.

Follow the same steps for the Code Violations CSV. Wrangle recipe below. MoCo will acknowledge its violations data is a little messy, so I deleted a lot of columns where there were missing values.

drop col: Latitude action: Drop
drop col: Longitude action: Drop
drop col: {:@computed_region_a9cs_3ed7} action: Drop
drop col: {:@computed_region_r648_kzwt} action: Drop
settype col: {Zip Code} type: String
drop col: {Date Closed} action: Drop
drop col: {Zip Codes} action: Drop
drop col: {Council Districts} action: Drop
drop col: Councils action: Drop
drop col: Communities action: Drop
drop col: Municipalities action: Drop
drop col: {Unit Number} action: Drop
drop col: {Service Request Number} action: Drop
drop col: {Service Request Created Date} action: Drop
drop col: {Service Request Closed Time} action: Drop
drop col: {Service Request Status} action: Drop
drop col: Corrected action: Drop
drop col: {Inspection Date} action: Drop
drop col: {Location Description} action: Drop
drop col: {Code Reference} action: Drop
drop col: Disposition action: Drop
drop col: Location action: Drop
drop col: {Date Assigned}: Drop

Now for the main attraction: from the console, go to BigQuery to start loading our tables.

Create a dataset where you’ll store tables. I’m calling mine “moco_rent_violations.”

Then create a new table.

I’ll name this table “code_violations” and find the source CSV from the Dataprep job output in Google Cloud Storage stored in bucket:

gs://dataprep-staging-*/<USER>/jobrun/

Instead of auto detecting the schema, I’ll actually be prescriptive and “Edit as text,” making all my fields type STRING:

Then click “Create table”! You can see successful (or unsuccessful) job loads in Job history.

We’ll similarly load our DHCA Rental Survey table and call it “rental”. Schema below:

Yay! All our tables are loaded.

2. Instead of nested joins, create lots of views

I’m going to now join these tables on zip code and selecting number of violations and the associated rent difference in that zip code. Enter the following in the BigQuery Query Editor:

SELECT 
zipcode,
count(DISTINCT Violation_ID) as violations,
AVG(Average_Rent_Percent_Difference_From_Previous_Year) as rent_diff
FROM
`<PROJECT-ID>.moco_rent_violations.code_violations`
JOIN
`<PROJECT-ID>.moco_rent_violations.rental`
ON
Zip_Code = zipcode
GROUP BY
zipcode
All the green check marks mean our query is good to go. Feel free to “Save query” for future reference.

Now to get ready for the next step of visualization, I want to join GIS polygons in this query (with the ZCTA dataset in GCP marketplace). If you haven’t gathered from my Dataprep run-through, I really don’t like long code, so if I can save myself typing a couple nested joins then that’s a win.

I can do this by saving these aggregated query results as a “View” and essentially interact with it as a table, but should the underlying data from the original tables change, the View would surface these updates in its aggregation.

Now I can join this violations_rental view with zip code polygons:

SELECT 
violations_rental.zipcode,
violations,
rent_diff,
zip_area.zcta_geom AS zipcode_polygon,
ST_GeogPoint(int_point_lon,int_point_lat) AS lat_lon
FROM
`<PROJECT-ID>.moco_rent_violations.violations_rental` AS violations_rental
JOIN
`bigquery-public-data.geo_us_boundaries.us_zip_codes` AS zip_area
ON
zip_area.zip_code = violations_rental.zipcode
WHERE
county LIKE '%Montgomery%'
AND state_name = 'Maryland'
ORDER BY
rent_diff DESC

Save the results as a View “violations_rental_zip.”

3. Visualize the situation today

Follow the instructions from the ZCTA dataset to get your project whitelisted to start using BigQuery Geo Viz.

Once whitelisted, paste the last query above in “1. Select data” then press “Run.” After processing, click “See results” to move to “2. Define columns,” and select “lat_lon” as the Geometry column.

Click “Add styles” to move to “3. Style.”

As an example, for fillColor, we’ll color as a Data-driven Function “linear” on Field “rent_diff.” You can then toggle the levels of coloring based on rent difference percentage.

For circleRadius, we’ll have a Data-driven Function “linear” on Field “violations” and set the circle radius levels.

If successful, we should have a circle plot of rent increases based on color and size based on number of violations.

4. Create, evaluate, and predict with ML

On first look, there doesn’t seem to be an overt correlation. But we can quantify correlation or lack thereof with linear regression. So let’s build an ML model using SQL! It’s easy.

DISCLAIMER: this is for demonstration purposes, as the number of samples (28) is too aggregated and ideally we would have data more local than zip code. For the model, I found zip codes 20905 (Colesville) and 20871 (Clarksburg) to be outliers based on rent percentages so we’ll exclude them. This is the “query” to create our ML linear regression model:

CREATE OR REPLACE MODEL `moco_rent_violations.rent_violations_ML`
OPTIONS(model_type='linear_reg') AS
SELECT
violations as label,
rent_diff
FROM `<PROJECT-ID>.moco_rent_violations.violations_rental_zip`
WHERE
zipcode != '20905'
AND zipcode != '20871'

In less than a minute, we’ll have a SQL table that is actually an ML model.

How good is it? Let’s evaluate using this query:

WITH eval_table AS (
SELECT violations as label,
rent_diff
FROM `<PROJECT-ID>.moco_rent_violations.violations_rental`)
SELECT
*
FROM
ML.EVALUATE(MODEL moco_rent_violations.rent_violations_ML,
TABLE eval_table)
To better understand the statistical results, look at this intro article

Seems all over. We could probably delete violations outliers as well to decrease mean_absolute_error, but we’d really have too few data. We’ll continue though for demonstration.

Let’s do a hypothetical prediction with this model! We’ll SELECT zip codes with their rent_diff numbers and use that to predict the hypothetical number of code violations for a given rent increase.

WITH pred_table AS (
SELECT
ST_GeogFromText(zipcode_polygon) as zip_polygon,
rent_diff as rent_diff
FROM `<PROJECT-ID>.moco_rent_violations.violations_rental_zip`
Where
zipcode != '20905'
and zipcode != '20871')
SELECT
*
FROM
ML.PREDICT(MODEL moco_rent_violations.rent_violations_ML,
TABLE pred_table)
Order by rent_diff asc

Alrighty then. But let’s go one step further and visualize this. As in, visualize this ML model. Paste this query into BigQuery Geo Viz, except this time, make the Geometry column “zip_polygon.”

Satisfying

Click on one of the zip codes. In the example below, the model is saying for a 1.8% increase in rent, we can expect around 6,500 code violations.

Conclusion: there are probably a lot of other factors at play, but this is a super easy, fast, and economical first step to visually gain an awareness. This entire process has run me up 25 cents in expenses including failures and missteps. None of this requires a PhD or computer science background, and given more data, we can create a full Tensorflow model using AutoML tables, which is literally point-and-click. Until next time.

--

--

Willis Zhang

Google Cloud engineer. Current obsessions: Kubernetes and data analytics for local governments. Posts and opinions are my own.