How to Find New York City Trends

Sofia Rubio-Topete
7 min readDec 5, 2019

--

We are currently living within a data-driven and technology-based society. There has been a rise in the association of negativity in data aggregation companies. While there has been in a rise in negative connotations towards companies that collect/ analyze data, there are existing accurate data available to conduct an analysis to find the truth in the data, rather than from the news we see on social media. NOAA’s Global Historical Climatology Network (GHCN) is an integrated database of climate summaries from land surface stations across the globe that have been subjected to a common suite of quality assurance reviews. The data are obtained from more than 20 sources. This public dataset is hosted in Google BigQuery and is included in BigQuery’s 1TB/mo of free tier processing. The City of New York also has several different types of data sets that are open to the pubic ranging from trees in NYC to Motor Vehicle Crashes. These datasets can be found here: https://opendata.cityofnewyork.us/.

Bigquery conveniently allows access to both public datasets, which is helpful given how large these data sets are. Two datasets that I found to be parallel were the car crashes in NYC and the weather for that day. The Motor Vehicle Collision data set includes several contributing factors such as driver fell asleep, alcohol intoxication, view obstructed/limited, pavement slippery, and more. Conducting a python analysis using these open data sets allows us to further understand specifics on motor vehicle collisions in New York City, including how contributing factors such as obstructed/ limited view and pavement slippery may have been a result of the weather. In order to incorporate weather data and conduct a proper analysis, the first step is to clean and organize the data sets. I found that many of the columns in the dataset did not have data for the most part. Calling this query also took a long time since there is so much data being pulled into the data frame I created, all_data.

Rather than dropping all of the columns from the all_data data frame, I used SQL to create a new data set using all columns except for some columns that were blank, such as contributing factors above 2, and vehicle codes after 2. I was able to do this by using SELECT * (ALL) EXCEPT (unwanted columns). The returned data frame was already organized which was very helpful given it is so large.

One of the issues with merging the weather dataset with NYC collisions is that the timestamps are different for both files. I used SQL again to split nypd collision data by the timestamp. I was able to do so by identifying the year, month, day, dow, etc. In order to get the number of all incidents so show up as a column in the dataset nypd_data, I used COUNT(*) in the select part of my query.

This data frame contains the number of reported motor vehicle collisions by hour by day. It ranges from 2012–2018. There are about 30 types of reasons along with the respective incident counts. The purpose of adding the additional columns for year, hour, day, etc., is not only so that a more holistic analysis could be conducted, but also so that other data from sets such as weather can be added to the collision report.

One quick visualization I was able to make to show the top collision factors between 2012 and 2016 is the bar plot below. Creating this visualization using plt was a simple way to aid people in understanding what the top causes have been for motor vehicle collisions over time.

To break up the data even further by the years, I created new data frames from the all_data data frame I had made previously, to organize the incidents by year. By specifying the year in the timestamp, 7 data frames were returned for each year between 2012 and 2018.

The weather data had a lot of data points, many of which would not be necessary. I decided to call another query to pull only specific points so the returned data frame would be organized in a way that would allow me to understand it better. I found that the precipitation, snow, temperature, visibility, and wind columns were the most important out of the weather data given that those may be the factors that contribute to motor vehicle collisions.

I did struggle more getting the weather data because I had to go through the original dataset in BigQuery and read documentation to format the query in the correct way, and using the correct values such as temp max/min, snow depth, precipitation. One command that I found very useful for joining values into a data frame was WHERE. I was able to only select data points from where the location was in NYC by specifying the lat/long and the State. This data frame contains the weather data different stations in New York City. Due to the fact that the time values are not populated for all of these, I decided to stick with daily analysis. The full data manual can be found on the NOAA website.

The ability to use JOIN was also very useful. The join clause merges two from items so that the select clause can query them as one source. The join type and on or using clause (a “join condition”) specify how to combine and discard rows from the two items to form a single source. The returned data frame is organized by the same timestamp structure as the collision data frame, therefore enabling us to merge the two data frames together.

The combination of the data was done in a few steps. First, the daily weather observations were grouped according to their station into a single set of values per day. I then took the median value of all the stations to come to a single value for that day. I grouped the NYPD incidents by day and kept only the top 15 causes for collisions. These two datasets were then joined together with all of the data in a single Pandas data frame.

When plotting this data, it seems as if there is a gap in the data between 2016–2017. To avoid this one could drop data from the time that it is missing, or just conduct the analysis before the data starts to go missing. I created another data frame to merge the weather depending on the day. I was able to do this by merging the collision data set on the day column.

I also created a correlation plot but I am not sure how accurate it is given there is such a large chunk of missing data. I put the types of weather at the bottom of the correlation chart, with the other contributing factors on the right.

Going forward, I am interested in continuing trying different methods of merging and creating datasets. One of the challenges I found while conducting this analysis was with filling in missing values. With that being said, I hope to explore this further so that I can get a better understanding on how this is done in an efficient and organized way. For instance, I hope to develop a deeper understanding of cleaning data, for instance how one can fill in missing location values like zip code based on others such as latitude and longitude. I found it to be really powerful to use both queries and pandas to merge, join, and organize datasets.

--

--