Using Bigquery to load NYC Collision Data

For this module, I was really interested in the different methods that can be used in Jupyter notebooks to analyze massive datasets. The focus of our module this week was to explore different methods of cleaning data, which is a necessary step in data analysis. In the past, I have been exposed and have has experience with cleaning data sets in excel, and in python using Pandas. One of the limitations I have found with the methods I have previously been exposed to in the past is that I struggle with analyzing massive datasets. One question I had was are ways that companies are able to analyze massive data sets in order to optimize their marketing. I came across an interesting dataset recently that contained several datapoints on motor vehicle collisions in NYC. I found this dataset to be interesting and to tie perfectly into my question because I wanted to better understand how insurance companies, for instance, may analyze motor vehicle collision data to gain more insights on the factors that contribute to these incidents, and where these occur to improve their marketing. This dat set is very large and so my goal was to investigate different methods that data analysts may use to conduct their analysis, and the reasons that they may choose this method.

BigQuery is a web service from Google that is used for handling or analyzing big data. It is part of the Google Cloud Platform. As a NoOps (no operations) data analytics service, BigQuery offers users the ability to manage data using fast SQL-like queries for real-time analysis. In the past, I have been very interested in using BigQuery to access very large data sets. This week, I decided to use BigQuery so that I could analyze the NYC Motor Vehicle Collision public data set. One of the reasons I was really drawn to using BigQuery, is that since the file is so large, it would allow me to access specific queries and pull data from the set rather than needing to work straight from the CSV file. BigQuery is a query service that allows you to run SQL-like queries against multiple terabytes of data in a matter of seconds. I have not used SQL in the past and was very interested to dive in BigQuery and begin exploring.

The first strep of this process was to set up an account for the Google Cloud Platform. I was able to get a $300 free credit for the first year, which was great since the APIs do cost money. BigQuery currently charges a flat rate of $0.02 per GB, per month for all stored data, but it is still nice to be able to use the $300 credit towards that. You can use BigQuery straight form your console, but I was curious as to how I could run BigQuery from my Jupyter Notebooks so that in the future, I could continue to access data sets from BigQuery in my notebooks. After I created my profile, I had to create a service account. I used the GCP console to create a JSON file that contained the key downloads.

Before you can access BigQuery from Jupyter Notebooks, you need to set an environment variable if you plan to use a service account.

I was able to access the JSON file I had downloaded in my terminal. This step provides the authentication credentials to the application code because the JSON file contains your service account key.

To set up a local environment, I had to run the code provided above in my terminal so that I could access the Pandas library from the BigQuery Client Library. This part was a bit tricky for me because I had to figure out the best way that I should set up the local environment with my terminal. I found that using terminal was the most helpful because it gave an immediate response if the desired libraries were downloading, or if there was an error. Once the install was successful, I went into my notebooks and defined my credentials once again.

To access the public dataset, I tried a couple of different ways based on the documentation I saw to get more familiar with the structure of BigQuery. The BigQuery client library provides a cell magic, %%bigquery, which runs a SQL query and returns the results as a Pandas DataFrame.

I created the data frame df and loaded all of the data into it so that I could analyze the NYC collision dataset with Pandas, a more known library to perform data analysis. Being able to store any query as a data frame is very important and helpful because it allows you to create and structure the data tables as you want. This is oftentimes very helpful because when dealing with extremely large data frames, it can be very time consuming to go back and clean it in rather than just construct clean data sets from the start.

Since this data set is so large, I wanted to divide it up into different data sets for each year.

Creating separate data sets for each year will be very useful in the future to compare collision factors over time.BigQuery allows you to save these into different datasets, that you can then call later to create visualizations with and further analyze.

Another way that you can create datasets out of this larger dataset, is to call more SQL queries. For instance, to analyze collision factors between the boroughs, I requested only the columns boroughs, contributing factors, and the number of persons injured.

Being able to call specific queries is very useful because it breaks down the data in an efficient way, allowing the process of cleaning to be much more efficient.

I would like to continue building off of step 1 and so that I can conduct a full analysis using SQL and BigQuery. I found this module to be really useful because it allowed me to investigate different methods that one can use to approach messy and complicated data sets. I had always heard of BigQuery in the past, and so I really enjoyed getting the opportunity to fully dive into this new tool for data aggregation, and analysis. While BigQuery has several amazing different features and abilities, it was a nice starting point to be able to learn how to get data from the public datasets it has, and the structure behind using this powerful technology. I think that one of the most challenging parts of my step 1 process was actually getting BigQuery initialized and installed on my computer, as well as understanding the basic use cases of it. To go back to my original question on why larger companies may use BigQuery to conduct their data analysis is because of the ability to request queries from very large data sets at once, in seconds, is extremely useful. For instance, an insurance company may conduct an analysis like this to see how many reported collisions are occurring around their targeted locations. Using BigQuery, they also have the ability to access their own Google Ads data which allows them to compare current analysis on motor vehicle collisions with their own customer acquisition and company data. One reason that this type of analysis using BigQuery could be useful for them is that if they wanted to target boroughs that had the highest collision rates, they could understand where exactly these collisions are happening and why. They then could increase their marketing efforts in these targeted areas so they could increase their customer acquisition and target people in these areas based on their insurance needs.

I am really excited to see how I can keep on using BigQuery in the future, and I hope that I can continue to gain a better understanding of how to use it for various use cases. In conclusion, I found that the ability to write queries for a massive data set is an extremely important and useful tactic for approaching data cleaning. It also forces you to think conceptually about the process you need to go through in order to get the necessary data for your analysis. This module highlighted the importance of outlining your process and identifying what pieces of data you need to perform your analysis before jumping into the code. As a data scientist, thinking conceptually about different ways to interpret data is extremely important. We have the responsibility to not only code and analyze the data, but also to create meaningful interpretations of it.

--

--