Combining Datasets

Sofia Rubio-Topete
6 min readOct 30, 2019

--

Combining Bigquery public data sets in python.

In module 2, I had started to use big query to access specific data points from larger datasets. One data set that I had been analyzing previously was NYPD Motor Vehicle Collisions from Open NYC data, which can be found here https://opendata.cityofnewyork.us/. This dataset has been updated frequently since 2012. For this module, I was interested to see how I could pull other datasets from BigQuery and combine them with the Motor Vehicle Collision data set so that I could further understand how other factors may have contributed to the number of accidents during a certain time, such as weather. The Motor Vehicle Collision data set includes several contributing factors such as driver fell asleep, alcohol intoxication, view obstructed/limited, pavement slippery, and more. I am interested to see how contributing factors such as obstructed/ limited view and pavement slippery may have been a result of the weather. In order to better clean and organize this data, I decided to use data frames so that I could combine them later on. I first saved all motor vehicle collision data to all_data. This enables me to get a better understand of what we are working with so that I can clean the data frames to later combine and analyze them.

This data set also has many blank columns/ columns that could be dropped and so I called another query so get all data except for my unwanted columns, such as contributing factors for vehicles 3, 4 and 5 because oftentimes those were blank, along with the data on those additional vehicle types. I was able to do this by using SELECT * (ALL) EXCEPT (unwanted columns). The returned data frame was already organized which was very helpful given it is so large.

Now that I had a data frame containing all data, I decided to make another that just focused on the collisions over the years. This data frame contains the number of reported motor vehicle collisions by hour by day. It ranges from 2012–2018. There are about 30 types of reasons along with the respective incident counts.

Having the reason for the motor vehicle collisions with the dates and locations isolated is very helpful because it will allow me to segment injuries based on reason, date, and location. The ability to segment based on the day and location can help me match up the weather so that I can then see if there are any correlations between the collision factors and the weather. I am hoping to further understand if collisions that occurred due to reasons such as pavement slippery were days that it was snowing or raining.

I created a bar plot to show the top collision factors between 2012 and 2016. Creating this visualization makes it easier to understand what the top causes have been of collisions over time.

To break up the data even further by the timestamps, I created new data frames from the all_data data frame I had made previously, to organize the incidents by year. By specifying the year in the timestamp, 7 data frames were returned for each year between 2012 and 2018.

The weather data had a lot of data points, many of which would not be necessary. I decided to call another query to pull only specific points so the returned data frame would be organized in a way that would allow me to understand it better.

I did struggle more getting the weather data because I had to go through the original dataset in BigQuery and read documentation to format the query in the correct way, and using the correct values such as TMIN, TMAX, PRCP. One command that I found very useful for joining values into a data frame was WHERE. I was able to only select data points from where the location was in NYC by specifying the lat/long and the State. This data frame contains the weather data different stations in New York City. Due to the fact that the time values are not populated for all of these, I decided to stick with daily analysis. The full data manual can be found on the NOAA website.

The ability to use JOIN was also very useful. The join clause merges two from items so that the select clause can query them as one source. The join type and on or using clause (a "join condition") specify how to combine and discard rows from the two items to form a single source. The returned data frame is organized by the same timestamp structure as the collision data frame, therefore enabling us to merge the two data frames together.

The combination of the data was done in a few steps. First, the daily weather observations were grouped according to their station into a single set of values per day. I then took the median value of all the stations to come to a single value for that day. I grouped the NYPD incidents by day and kept only the top 15 causes for collisions. These two datasets where then joined together with all of the data in a single Pandas data frame.

When plotting this data, it seems as if there is a gap in the data between 2016–2017. To avoid this one could drop data from the time that it is missing, or just conduct the analysis before the data starts to go missing. I created another data frame to merge the weather depending on the day. I was able to do this by merging the collision data set on the day column.

I also created a correlation plot but I am not sure how accurate it is given there is such a large chunk of missing data. I put the types of weather at the bottom of the correlation chart, with the other contributing factors on the right.

Going forward, I would love to continue trying different methods of merging and creating datasets from two different places. I found it to be really powerful to use both queries and pandas to merge, join, and organize datasets. I hope that I can continue to work with this module so that I can create more extensive visualizations and come to better conclusions.

--

--