Investigating New York City Public Datasets with BigQuery

Reto Meier
Google Cloud - Community
5 min readFeb 15, 2017

--

Originally published on the Google Cloud Big Data and Machine Learning blog

New York City is home to 8.5 million residents, and more than 50 million people visit this vibrant and dynamic city each year. With so many sights and sounds, it’s easy to get lost in the details, and lose sight of the big picture: How do New Yorkers actually survive the concrete jungle?

Thanks to NYC Open Data, which makes public data generated by city agencies available for public use, and Citi Bike, we’ve incorporated over 150 GB of data in 5 open datasets into Google BigQuery Public Datasets, including:

There’s no cost for the first terabyte of data you process each month, and because BigQuery is serverless, there’s no infrastructure you need to manage or maintain. That means we can focus on querying, joining and visualizing this data to learn more about New York City and the people who make up this bustling metropolis.

As you’ll see below, all these new data sets can be used with existing data, such as NOAA GSOD to discover trends based on changes in weather. We’ll also be continually adding new datasets from other cities, so soon you’ll be able to compare habits and trends between cities and countries around the globe — using BigQuery to better understand the world around us.

Find the New York City streets are you most likely to find a loud party

If there’s something strange in your neighborhood, the right number to call is 311; created specifically for non-emergency municipal inquiries and non-urgent community concerns. What does that include?

The graph below shows the top five reasons why New Yorkers call 311 over the past 4 years.

SELECT
Extract(YEAR from created_date) AS year,
REPLACE(UPPER(complaint_type), "HEATING", "HEAT/HOT WATER")
AS complaint,
COUNT(*) AS count
FROM
`bigquery-public-data.new_york.311_service_requests`
GROUP BY complaint, year
ORDER BY count DESC
LIMIT 1000

(To run this query yourself, you can copy/paste the above SQL into BigQuery, or follow this link to my shared query.)

Call volume tells us that it gets noisy in New York, and it also gets very cold. By joining the 311 calls to the NOAA GSOD weather table, we confirm that most calls about faulty heat and hot water happen when the temperature drops — while noise remains a constant annoyance.

Use this BigQuery Query to recreate the graph

There were also 267,887 calls about dead, damaged or dying trees, so you might wonder if there are any healthy trees left in NYC.

Find the Virginia Pines in New York City

The decennial NYC tree surveys from 1995, 2005, and 2015 are all available in BigQuery, and the 2015 survey found the London Planetrees, Honeylocusts and Callery Pears represented almost a third of all trees outside of parks.

Use this BigQuery Query to recreate this heatmap

Find the only collision caused by an animal that injured a cyclist

There’s a lot of traffic in New York, and while the number of accidents has increased each year, the number of injuries has remained fairly consistent.

Fortunately, the number of deaths has dropped by an average of 9% each year.

As you can see below, “Driver Inattention/Distraction” is the most likely cause of accident and injury, but disregarding traffic control (such as running a red light) is the most common cause of death.

The following graphs show that most traffic accidents happen in Brooklyn, but it’s Midtown and Downtown Manhattan that have the highest concentration of collisions — and Staten Island the highest proportion of deaths per accident.

Use this BigQuery Query to recreate the graphs

With motor vehicle accidents resulting in 6 motorist deaths for each cyclist death (and no Citi Bike rider deaths), you might be safer taking a Citi Bike.

Find the Citi Bike record for the longest distance in the shortest time (on a route with at least 100 rides)

Comparing the average duration of 5 of the most popular Citi Bike routes, to taxi journeys beginning and ending within an approximately 50-meter radius of the corresponding Citi Bike stations, we see that for trips under 10 minutes there’s not much difference between taking a taxi or riding a bike.

Use this BigQuery Query to recreate the graph

There are countless ways to slice, dice, join and visualize this data

Share your own insights and visualizations using the hashtag #TILwBQ, and join us here every week for Today I Learned with BigQuery, as we dig into these tables, launch new public datasets, demonstrate BigQuery, share protips and offer interviews with Big Data industry experts.

If you’re new to BigQuery remember that everyone gets 1TB at no charge every month to run queries; you can follow these getting started instructions to… get started.

--

--

Reto Meier
Google Cloud - Community

Developer Advocate @ Google, software engineer, and author of “Professional Android” series from Wrox. All opinions are my own.