Four Seasons (and 5 Boroughs) in One Post with BigQuery

Using Google BigQuery to explore how weather affects NYC

Reto Meier
Google Cloud - Community

--

Originally published on the Google Cloud Big Data and Machine Learning blog

With over 150GB of New York City public data, parsing it all for patterns and insights is a challenge. One solution? Combine it with another 30GB of weather data, and use the CORR function to find correlations for you.

Correlation doesn’t imply causation, but it can help you identify patterns worth exploring. By finding the highest correlations between weather variables and the NYC datasets, I’m going to try and answer a number of weather-related questions about the city:

  • Does the temperature affect Citi Bike and taxi rides?
  • Does weather affect the causes of motor vehicle accidents?
  • Do wind gusts cause runaway cars?
  • Which 311 complaint is most closely related to low temperatures?
  • What’s the effect of snow on manhole covers and illegal parking?
  • Are rats and dead trees related to warmer temperatures?
  • Do noise complaints drop when wind speeds rise?
  • What happens to dogs in New York in the summer?

Does the temperature affect the number of Citi Bike and taxi rides?

Inspired by Sara Robinson’s exploration of Citi Bike data, I thought I’d look at the relationship between Citi Bike rides and temperature. We can see that there’s an apparent relationship between temperature, and the rides taken on the Citi Bike network.

Here I’m comparing the number of riders each day, normalized against the average number of rides for that day-of-the-week. As the temperature rises, so does the number of rides being taken. The heatmap shows us the concentration of starting stations — as the temperature increases, more rides begin in the outer boroughs.

Let’s apply the same experiment to the NYC Yellow Cab ride data. I’m going to simplify the graph by bucketing temperatures into degree increments, and finding the average number of trips taken within each bucket.

Use this BigQuery Query to get the data used in this graph

There’s some noise at the upper and lower end, where we have fewer samples, but nothing that indicates a real relationship. But hot, cold or temperate — New Yorkers take a lot of cab rides.

Inside the NOAA “global summary of day” (GSOD) weather dataset

Before answering the rest of our questions, let’s take a quick detour to simplify things.

The GSOD dataset from NOAA, one of several public datasets available in BigQuery, includes temperature, precipitation, snow and wind from over 9,000 stations dating back to 1929. To reduce time and cost, let’s extract only the data we need.

The following query selects only the weather variables I need from the observations recorded between the 2009 to 2016 span of our NYC data, from the two NYC airports — JFK and La Guardia, calculating the average results between them for each day.

SELECT 
-- Create a timestamp from the date components.
timestamp(concat(year,"-",mo,"-",da)) as timestamp,
-- Replace numerical null values with 0s
AVG(IF (temp=9999.9, 0, temp)) AS temperature,
AVG(IF (visib=999.9, 0, visib)) AS visibility,
AVG(IF (wdsp="999.9", 0, CAST(wdsp AS Float64))) AS wind_speed,
AVG(IF (gust=999.9, 0, gust)) AS wind_gust,
AVG(IF (prcp=99.99, 0, prcp)) AS precipitation,
AVG(IF (sndp=999.9, 0, sndp)) AS snow_depth
FROM
`bigquery-public-data.noaa_gsod.gsod20*`
WHERE
CAST(YEAR AS INT64) > 2008
AND
(stn="725030" OR -- La Guardia
stn="744860") -- JFK
GROUP BY
timestamp

The total size of the new table is under 200k, so from here on, our weather joins cost basically nothing.

Does weather affect the likely causes of motor vehicle accidents?

This query uses the CORR function to calculate the Pearson Coefficient — the linear dependence (correlation) — between each weather variable (temperature, precipitation, snow, wind and visibility) and the number of motor vehicle accidents for each primary cause.

Use this BigQuery Query to generate this output table.

The results suggest slippery pavements cause more accidents when there’s more snow on the ground, and that there are more driverless / runaway when it rains and there are gusty winds.

By graphing snow depth and slips, we can see a possible pattern. The greatest impact appears to be on the heaviest day of snow.

My gut tells me gusty winds could cause more cars to slip away from their drivers and escape down slippery, rain-covered streets — but with only four months of data, graphing the results doesn’t really support it.

Calculating correlations for weather variables and 311 complaints

311 is a service offered by New York City (and many other municipalities) for non-emergency inquiries and non-urgent community concerns.

Let’s start by running queries to find which 311 complaint categories have the highest correlations with temperature, snow, wind speed and rain, and take a look at the top results to answer our questions.

Which 311 complaint is most closely related to low temperatures?

The strongest correlation is the least surprising: The call volume for complaints regarding heating are significantly higher when the weather is colder. Looking at the data, you can see the trend line is a good fit for an exponential relationship.

What was more surprising was seeing a similar correlation strength for missing manhole covers related to the depth of snow.

What’s the relationship between snow, missing manhole covers and illegal parking?

Looking at the scatter plot of call volume against snow depth and temperature, we can see that there does appear to be a relationship between snow depth and missing manhole covers.

If we plot our data against time, we get a clear picture that snowfall is likely a significant contributor.

There’s a pattern of spikes in both snow and call volume throughout the past 7 years. It’s unlikely to be a causal relationship (unless people are using the cover of a snowfall to steal manhole covers. Or — if the causation is in the other direction — manhole covers are the only thing between New York and a winter apocalypse.)

Graphing illegal parking mainly shows a trend of New Yorkers doing so more often. But blocked hydrants seem unrelated to snowfall, with a trend towards more annual reports and higher snowfall likely causing the false correlation.

Double parking features two spikes consistent with heavy snow in 2010 and 2016, but little correlation with snowfall in between those years.

Are rats and dead trees related to high temperatures?

The primary correlation for dead or dying trees appears to be seasonal — with call volumes growing in spring and peaking at the start of summer.

Use this BigQuery Query to recreate the graph

I showed this to Lak Lakshmanan, an expert on BigQuery and weather data, and he suggested that we seasonally adjust these figures against the monthly averages.

The following graph measures the difference from the monthly average temperature, against the proportion of monthly call volume. Without the seasonal effects, the relationship looks very weak.

If we seasonally adjusted rat sightings, there might be a trend of high temperatures with increased rodent sightings. The scatter point is noisy, but if we bucket our results by finding the average per degree in temperature, we get a much clearer indication of a relationship.

The bucketed graph offers a clearer signal, but by hiding the variance it makes the relationship appear stronger than it really is.

Which is why I was thrilled when Lak introduced me to the hexbin graph!

Hexbin graphs group together all the points that land within a given hex, and assigns that hex a color according to the number of points within it.

It results in a clearer pattern than the scatter plot, but retains the variation. The distinctive diagonal band from lower left to upper right, shown in both the overall shape and in the area of greatest concentration, suggests a weak correlation between heat and rat sightings.

Do noise complaints drop when there’s higher wind speeds?

Barking dogs, loud talking, car music and loud parties have a correlation coefficient of at least -0.75 with wind speed. But, we also know New York City has windier weather in the winter, so let’s seasonally adjust, and jump straight to our hexbin graph to see if there’s more to it.

There’s a triangular pattern in the lower left corner of each graph that implies a weak, negative correlation — ​ a greater likelihood that there’ll be more noise complaints when the wind is low, and significantly fewer when the wind is 5 knots above the monthly average.

The strongest pattern is for parties, which demonstrates the same trend in the overall distribution, as well as a similar pattern within the region of most concentrated results.

Except for barking dogs. What do you think is going on here?

Use this BigQuery Query to recreate this graph.

Dig in and find other nuggets hidden within the data

I’ve shared a larger subset of possible weather/complaint correlations in this Google Sheets spreadsheet. It’s filtered to show only complaints with a Pearson Coefficient stronger than +/-0.7 for temperature, snow depth or wind speed — and +/- 0.5 for precipitation.

Once you dig in, you’ll find multiple confounding factors, including differences between boroughs, seasonality, annual growth, the relationships between variables and many others. When you’re done, make sure you share the results using #TILwBQ.

If you’re new to BigQuery follow these instructions to get started, and remember that everyone gets 1TB at no charge every month to run queries.

--

--

Reto Meier
Google Cloud - Community

Developer Advocate @ Google, software engineer, and author of “Professional Android” series from Wrox. All opinions are my own.