New York City Vehicle Collisions: A Study of Brooklyn

Photo by Luis Cortés on Unsplash

What are the main causes of vehicle collisions? What weather conditions record the highest accident rates? Some of the questions we asked ourselves and asked the data.

In Fall 2021, I took the class Processing Big Data for Analytics Applications at NYU Courant which introduces us to the Hadoop ecosystem, and a wide range of tools and software for, as the title suggests, processing big data. I had the chance to run various pieces of code on huge datasets thanks to being granted access to NYU HPC cluster.

As part of the class, I worked with a fellow teammate on a final project in which we wrote MapReduce code and Spark scripts to analyze two public datasets we acquired from NYC Open Data. The result of our analytics was a study into vehicle collisions in New York City in association with a number of environment elements (weather condition, temperature, precipitation, time of the day, etc.) and the relationship between them. The question we were most curious about was, can environment information be predictive of vehicle collisions? This article summarizes the steps of the project and the findings we drew from it.

For more details, you can find the code written for the project on my GitHub.

Motivation — Why study it?

The US Department of Transportation estimates an average of over 5.8 million vehicle crashes annually, around 21% of which (making up over 1.2 million crashes) are deemed to be weather-related. Traffic safety is no doubt an important subject.

A more anecdotal view into how we decided on the subject, please feel free to skip: I started out clicking through pages of available datasets to look for an idea for the project. My teammate and I agreed to narrow our scope to New York City: we’re both not familar with the city, we’re only here for a short time (we’re both NYU Abu Dhabi students spending a semester at New York), why not take this opportunity to learn something about the city while we’re at it? The vehicle collisions dataset first caught my attention: it records a lot of traffic accidents, and the data is from 2012 onwards. It’s a huge dataset sitting available there, maybe we might be able to explore & mine from it? That’s when we decided on the first dataset, and started to look for a second. The Brooklyn Bridge pedestrian counts dataset seemingly had nothing to do with the first. But the title of the dataset makes it easy to be overlooked: it has a range of weather information too, aside from the pedestrian counts. We settled on these two as our main datasets. This project of ours can probably also be considered a study into mining seemingly unrelated data to extract insights. The data we used was public data putting out there by government agencies. It’s the era of Big Data (after all, that’s what I’m studying). The question is what you do with all that.


As mentioned, the data was obtained from NYC Open Data: vehicle collisions & Brooklyn Bridge pedestrian counts.

The first dataset, vehicle collisions, has information from NYPD reports of all collisions in NYC that resulted in some form of human casualty, or that there was at least $1,000 in damage. What I cared about the most is that the rows in the dataset — the collisions — are recorded with when and where it happened.

The second dataset, which we called Brooklyn Bridge for short, has information from the DoT’s project for automated counting of pedestrians on Brooklyn Bridge. That is not our main focus; we need the information that comes along: weather information.

Cleaning the data

This was a vital step: I got to know what I was working with so I could prepare myself. I was responsible for the vehicle collisions dataset, which really took me some time.

Cleaning the data was done using MapReduce code written in Java, which meant batch processing: I was not looking into the specific rows as I went; instead, I needed to write data cleaning code that would scale well to millions of rows in the dataset. Two main problems emerged:

  1. Some lines were missing several values: the original datset schema told me there are 29 columns. It should follow that I would get 29 values whenever I were to split a line at a comma, but I was getting lines with 24, 25 values. This turned out to be a pretty niche thing with Java I hadn’t known about: if the line ends with consecutive empty cells i.e. commas, the split() function straight up ignores them. The solution for this was to add some random string to the end of every line to force Java’s split() function to pay attention to these empty cells.
  2. Some lines were having extra values: I was getting lines that were longer than 29 values, which was also not supposed to happen according to the schema. It turned out that the addresses very oftentimes included commas between them, and Java’s split() cannot tell the difference between those and commas that separate values. The workaround for this was a bit longer, in which my Natural Language Processing class came into handy and I used a bit of Regular Expression:

Data extraction: Vehicle collisions

First, I extracted the relevant data from the collisions. The first necessary thing was the date and time: the second dataset has date time on the same column, but this one separates date and time into two columns. All of my experiences with data processing and analytics were in Python, and for this project I was writing Scala code, so it took me some time to grasp all the syntax. But as a result, I learned a lot of interesting and useful Scala functions, one such example is the one I used to solve the date-time problem.

I first treated the two columns as columns of string and was able to produce a third column by concatenating these two columns together using concat(), with a space in the middle. Then, this newly created column had the same format for every row, which could be used in to_timestamp() to cast the entire column to Scala’s timestamp data type. With the column now in a Scala’s data type, it turned out that I could also do comparisons in the column with gt() and lt().

Another interesting function I learned while working with timestampt data type is trunc() which can output the timestamp truncated to the nearest earlier year/month/day/etc. I used this to truncate the time to the nearest earliest hour.

After extracting the relevant information, it was time to join the two datasets based on date & time (assuming my teammate was done with processing the other dataset).

Logistic Regression

To return to the previously mentioned question of whether environment information can be predictive of vehicle collisions, we used a Logistic Regression model from the Spark ML library. Before feeding data into the model, we encoded and normalized the variables.

The weather description column contains 10 unique values, so we encoded them using values from 0 to 9. Using the number of people injured or killed, we created a new column of casualties which is equal to the sum of people injured or killed. Any rows without values in this column are hours where no collisions happened. We recorded this in an additional column “has accident”, where rows with non-null casualties (even 0) were assigned a “has accident” value of 1, otherwise when casualties were null, they were assigned 0. This column would serve as the target label for the model. We normalized all relevant columns so that all numerical values were in the range from 0 to 1.

The finalized variables used for the Logistic Regression model include: temperature, precipitation, weather description, day of the week, hour of the day, pedestrian counts, and collision occurrence, in which the last variable was the target value and the rest served as predictor variables.

The model used about 2700 data points for training and evaluating, in which 70% (about 1900 data points) was used for training and the remaining 30% (about 800 data points) was for testing.

We configured model hyper-parameters as follows:

When evaluated on the test set, we got an accuracy score of about 87.8%. On the train set, the accuracy score was about 88.5%. The parameters learned by the model are as follows:

Coefficients: [0.0, 1.1674480524231052, 1.7884268499883167, 0.0, 3.0271637837630037, 8.083478791471325]Intercept: -3.0356003153483573

The coefficients correspond to: temperature, precipitation, weather description, day of the week, hour of the day, pedestrian counts.

Some variables, like precipitation, weather conditions, hour of the day, and the number of pedestrians have higher coefficients and therefore a change in one of these predictor variables will result in a higher likelihood of a collision happening, when other variables are kept the same. Coefficients associated with temperature and day of the week are close to 0, indicating that the effect of these variables on the likelihood prediction is fairly small.

Results with the Logistic Regression model showed that it is possibly to incorporate weather and environment variables in predicting the likelihood of vehicle collisions. It is also good to keep in mind that we were using a model of 3000 data points, so it’s good to obtain more diverse data points for a more sophisticated model. We would also like to look into the statistical significance of the model.

Collision factors and weather conditions

We first merged the two datasets in the opposite direction as we did for the Logistic Regression model. This time, all crashes from the vehicle collisions dataset were supplied with the weather information from the Brooklyn pedestrian counts dataset. Then, we iterated through all weather conditions and got the counts of collisions happening during days with a given weather condition. Within each weather condition type, we iterated through each of the factors contributing to the crashes and recorded the counts as well. Finally, we computed an additional "fraction" column which is equal to the number of collisions caused by a pair of weather condition - contributing factor divided by the total number of collisions happening during that weather condition.

I’d never wrote a Scala loop prior to this. This piece of code was a proud moment.

Collision factors, weather conditions, and casualties

I wrote a similar piece of code as the previous section, but instead of inspecting weather condition — contributing factor pairs by collision fraction, we looked into casualties.


So what have we learned from the datasets after all that Java and Scala code?

  1. Environment elements most predictive of vehicle collisions using the Logistic Regression model: precipitation, weather condition, hour of the day, and number of pedestrians.
  2. Days with high visibility (clear, partly cloudy) are more prone to accidents than days with more severe weather conditions (rain, snowd, fog, etc.).
Collision counts by weather conditions

3. In all weather conditions, when specified, the majority of accidents are due to human factors (distraction, following too close, etc) rather than vehicle or infrastructure defectives. A recommendation we propose based on this is that government agencies pay attention to mitigating human-caused collisions with strategies or solutions that target drivers and pedestrians first, like more traffic dispersion during rush hours to decrease the scenarios when vehicles move too close to one another.

(We say “when specified” because the top factor contributing to vehicle collisions is Unspecified, as in it is written as “Unspecified” or left empty.)

Collision factor counts during clear day condition (top 10 factors excluding “Unspecified”).
Collision factor counts during rain condition (top 10 factors excluding “Unspecified”).

4. In more severe conditions (rain, snow, fog), weather shows to contribute to accident. During these severe conditions, we see an increase in external factors contributing to collisions. (Pay attention to accidents happening in sleet, snow, and fog due to pavement slippery.)

5. Human factors during high visibility days also result in the highest casualties.

Of course, we were working under a set of assumptions and generalizations here: the weather data we used was recorded for Brooklyn Bridge, and we were generalizing the weather information for the entirety of Brooklyn borough. Or another example is in our use of pedestrian counts: we were using pedestrian counts in the model under the assumption that the number of pedestrian on Brooklyn Bridge could be an indicator of the volume of traffic on the streets of Brooklyn in general during that hour.


I worked on a project that examined public traffic and weather datasets to study the impact of weather conditions and environment elements on vehicle collisions. The main thesis was on investigating whether environment information can be predictive of traffic accidents in New York. Preliminary tests of our simple Logistic Regression model, trained on about 3000 data points, shows it is possible to incorporate environment data in predicting vehicle collisions.

Further exploratory analysis highlights the role of human factors (and external factors in case of severe weather conditions) in causing the majority of collisions.

Future analytics on the same topic will benefit from more detailed and specific data sources, and can be developed to be used in more sophisticated models.

That concludes my very first Medium post and my project for the class. I’d like to thank Prof. Malavet for the class content and all guidance and feedback on the project.

A corner of Brooklyn as viewed from NYU Tandon. Photo by me :) I’d like to use this image caption to admit I’ve never seen Brooklyn Bridge. I hope I could see it before flying out of New York.




Computer Science student @ NYU

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Lorenz Attractor. Weather Forecast Explained.

Why I believe in Karma.

ML13: PyTorch — Simple Linear Regression

Basic Web Scraping Tutorial — Federal Reserve Balance Sheet

How to export millions of records from Mysql to AWS S3?

Apache Spark ecosystem. Credits: DataBricks

Plant pathology detection — FGVC8

Data science ethics book club #4 by DataKind UK

Python is First Step to Data Science

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Ngoc N. Hoang

Ngoc N. Hoang

Computer Science student @ NYU

More from Medium

Petaling Jaya Property Analysis: Unsupervised Machine Learning

Predicting Walmart Sales

Should the Dutch Labour party and the Dutch Green party merge: answered with data science

Machine Learning in Sentiment Analysis