One of my favorite things to do is ride my bike. I ride almost every day — whether it’s part of my commute or on the weekend to explore, I’m always riding. People ride for all sorts of reasons — leisure, exercise, and as a mode of transportation, just to name a few. The number of cyclists/bike riders in the U.S. saw a considerable increase between 2012 and 2014, and has remained fairly stable since then. In 2016, around 12.4 percent of Americans cycled on a regular basis.

My noble steed— photo taken from atop Hawk Hill

Cycling has become largely popular in urban areas. Specifically, densely populated major cities, like San Francisco and New York. These cities have all different types of riders — utility cyclists, such as delivery and messenger services; recreational cyclists; and, increasingly, commuters.

New York City’s Central Park
Fun Fact: NYC developed the country’s first bike path in 1894, called the ‘Coney Island Cycle Path.’ A five and a half mile stretch from Prospect Park in Brooklyn to the popular resort at Coney Island.
Coney Island Cycle Path — Brooklyn, New York

NYC has been at the forefront of the national initiative to make cycling safe. But despite these efforts, there are many cyclists killed and seriously injured every year. In the US in 2014, 21,287 cyclists were injured in reported road accidents, including 3,514 who were killed or seriously injured. These crashes not only harm or kill the riders, but the individual operating the motor vehicle involved may, in turn, be injured as well. The amount of compensatory and/or punitive damage resulting from a crash could often be devastating. Given the costs involved, cyclists and drivers alike must be aware of the dangers and learn how to share the road harmoniously.

So, what makes riding in big cities so dangerous? As a rider myself, I approached this problem from the perspective of — ‘what would I want to know about the dangers of riding in a big city?’ Since, almost daily, I’m riding my bike one way or another in San Francisco. I also had some preconceived thoughts around which factors made cycling most dangerous: weather, time of day, and perhaps the time of year. Intuitively, one would think that if it’s raining out, or it’s late in the day, or the city is busier because it’s the holiday season, more crashes would occur on the streets.

In a city like San Francisco, the roads become jam-packed during commuting hours. Riding on Market Street in downtown SF between the hours of 4 and 6pm is quite terrifying. There are countless bikes, cars, taxis, buses and pedestrians. Also, not everyone has the proper reflective gear or lighting to make themselves easily visible. This also holds true for New York.

Market Street during rush hour

I wanted to put my theories to the test. Does weather, time of day and year truly have effect on your safety while riding your bike?

In order to accomplish this, I needed the data. I was able to find NYPD motor vehicle crash data covering the years spanning 2013–2015. Included in this massive dataset was every type of crash imaginable — car on car, car on bike, taxi on pedestrian, bus on bike…you get the idea. I also got weather data covering the same timeframe. This dataset was extremely descriptive. It had all different types of weather-related measurements for each day during the three years. I also was able to get a set of data that provided the number of riders on the road for the Manhattan borough.

click to see where the cyclist counts took place

There was a total of 43,766 riders and 3,475 crashes that occurred during the three year window.

Using this information, I could then see the proportionate trend of the amount of crashes against the number of riders on the road over time. Unfortunately, this was only for Manhattan since that was the only borough with an easily accessible cyclist count dataset.

Total cyclists in Manhattan vs. total crashes in Manhattan

The ratio of crashes to total riders:

Ratio of total crashes out of total riders in Manhattan

So, the number of Manhattan riders dipped a bit in 2014, but shot back up in 2015. Helpful for my analysis? Not tremendously, but it is good to know that the crash ratio declined a bit. Could this mean that some of the initiatives NYC has implemented are working?

Next, I wanted to look at NYC as a whole instead of just focusing on Manhattan. Over the three-year span there were 11,252 reported cycling crashes on the roads of NYC. More than likely a significant number of crashes happened that went unreported.

Cycling crashes by NYC borough by year

Evidenced by the above graph, Brooklyn consistently has the highest number of crashes, followed by Manhattan, Bronx, Queens and Staten Island. Unfortunately, I only had cycling count data for Manhattan so I can’t definitively say that Brooklyn has the most riders, but it is probably a safe assumption.

click for an interactive heat map of 1,000 NYC crashes

Next, I wanted to see if the time of year had any impact on the number of crashes. I first grouped the crashes by month.

Number of crashes by month

Based on the total number of crashes, as shown in the above graph, the summer months of June, July, August and September are the most abundant in terms of cycling crashes.

I didn’t want to only look at the sheer number of crashes that occurred, so I added a feature to my data which is a ‘crash ratio’, or crashes over the number of cyclists on the road for each year. Again, since I only had cyclist counts for Manhattan, the below graph is only for that borough. As you can see, the crash rate for the most part peaks during the summer months.

Crash ratio over time

My assumption here is that during these months of the year, there is the least amount of precipitation. Just to check, I plotted the total precipitation across the three years.

Precipitation by month

As evidenced by the graph, the amount of precipitation stays relatively stable throughout the year in NYC, despite a couple spikes in rainfall. May and June of 2013 saw an uncharacteristically high amount of rain, which correlates with the same trend in crash rate for those months.

Another interesting finding was that Brooklyn received the most rainfall over the three years. Brooklyn also saw the highest number of crashes over the three years.

How about crashes by day of the week?

Crashes by day of week

There were typically more crashes during the first few days of the week, then dropped as the week progressed. This made sense to me for a couple reasons. First — people are more inclined to start the week off with a healthy mentality, meaning they’re more likely to hop on their bike and ride to work. With that comes more people driving to work, therefore more congestion on the road which leads to more crashes. Second — typically, as the week progresses, more people work from home which means fewer drivers on the road. The fewer drivers on the road, the safer the roads are for cyclists.

Note: These assumptions are based on personal experience.

Now let’s take a look at the number of crashes by time of day. Before beginning my analysis, I thought for sure I’d see a really significant relationship between the hour of day and the number of crashes on the road.

Crashes by time of day

And…I did. First of all, I segmented the 24 hour day by increments of around 6 hours. That way I could capture the commuting hours in two buckets — evening: 3–6pm, and mid-morning: 7–9am. Ideally, this would provide more predictive power when running my model. Sure enough, the two commuting buckets had the highest number of crashes.

Now that I’ve sliced and diced the data, it’s time to build a model that will hopefully give me some power to predict a cycling crash using the features documented thus far.

Time to model. But first, a quick note — Something that I learned as a result of this project was, that, predicting an anomaly is extremely difficult. To provide some context, my dataset had a total population of 475,591 data points. Of those, only 11,252 were crashes. In other words, I needed to build a model that can predict something that only happens 2% of the time. I knew after exploring the different variables along with looking into the coefficients for them, that I would want to create some additional features to incorporate.

Part of my analysis was to look into each borough individually and explore the most dangerous intersections, or the ‘hot spots’ where the most crashes had occurred over the three year span.

For purposes of this project, I modeled with Manhattan data. Eventually I will expand to include the other boroughs.

Plot of all Manhattan crashes
  • Manhattan’s crash sites. A cool part about the dataset that I worked with was that it provided latitude and longitude coordinates for every crash, which made identifying the ‘top’ intersections fairly easy.

click for a map of Manhattan’s most dangerous intersections

click for a map of Brooklyn’s most dangerous intersections

click for a map of Bronx’s most dangerous intersections

click for a map of Staten Island’s most dangerous intersections

click for a map of Queens’ most dangerous intersections

So, what makes an intersection a ‘hot spot’? After much thought, I decided that any intersection where there were more than 4 crashes over the three years warranted being a hot spot. In Manhattan, there were 59 intersections where over 4 crashes had occurred. I didn’t want to use all 59 as my hot spots so I decided to use a KMeans clustering algorithm to plot 10 centroids in the most active areas of Manhattan.

10 centroids plotted over the 59 hotspots identified

So, now that I had the centroids plotted and ready, it was time to compare the location of them to the location of all Manhattan crash sites. The solution was fairly simple — if the crash site is close to one of the hot spots, then there is more chance of a crash.

I decided to use the Haversine formula to calculate the distance between the crash sites and the centroids. Haversine distance is a calculated distance between two geographical points taking into account the curvature of the earth. It’s a little more accurate than Euclidean distance.

So, I took each Manhattan crash site and calculated the distance from each of the hot spots. The resulting distance would act as a predictor for my target variable — crash vs. no crash.

I decided to use Random Forest Regression and K Means Clustering models using the following as predictors: hour of the day, day of the week, month of the year, and the 10 hot spot Haversine distances. The Random Forest scored higher than the K Means model, but not by much.

Both models scored high. This isn’t because they’re great at predicting if there will be a crash necessarily, though. The thing to remember when predicting anomalies is that when you train your model, it will teach itself to predict the majority class — in my case, the non-crashes. Given this fact, my models scored high because of the fact they were predicting the majority class, and predicting correctly.

Using Random Forest, I was able to predict 60% of the total crashes accurately. Alternatively, of the total crashes that I predicted, 90% of them were relevant.

Random Forest classification report

Using K Means Clustering, I was able to predict 98% of the total crashes accurately. Alternatively, of the total crashes that I predicted, 98% of them were relevant.

K Means Clustering classification report

In summary…

This project was challenging for me in many ways. I started with a target that is intrinsically an anomaly — predicting cycling crashes. The number of riders on the road heavily outweighs the number of riders involved in crashes.

Through exploratory data analysis, I was able to find relationships between cycling crashes and other variables that inspired me to explore further. However, when I began to model my data, I ran into some difficulties; where the features I thought were of importance , were actually were not providing much predictive power. Why did this happen? Well, for example, the number of crashes that took place during commuting hours was higher than the number of crashes during all other hours of the day. However, those same hours shared a spike with all the other data points in my set. Given this, it was difficult for my model to see through all the noise.

Something else that I learned during this process is that the model building process never ends. You can spend hours and hours exploring the data, creating more features and trying out different models, but when do you call it complete? That’s a blessing and a curse I suppose. It’s exciting knowing that to know there’s always room for improvement, but it’s also daunting at the same time. I’ll continue to fine -tune this model and I’ll post updates when I do.

That’s it for now — See you soon.