Using Machine Learning to Predict Car Accident Risk

Published in

GeoAI

12 min readMay 3, 2018

This article reflects some of the work we are doing at Esri to define spatially focused artificial intelligence and machine learning, however: The opinions in this article are my own, and not necessarily the opinions of my employer. This article is intended to be a simple technical introduction to one application of geo-spatial machine learning, not a fully mature solution. There’s a lot of excitement and some great, truly innovative work going on at Esri. I’m very excited to be a part of it!

Car accidents are a huge problem in our world. Nearly 1.3 million people die internationally every year from car accidents and in addition up to 50 million people are injured (ASIRT). Can machine learning help save lives? I believe the answer is yes, and this article details one possible approach.

Many governments collect accident records and make these data publicly available. In addition, numerous road infrastructure data sets. We will make use of publicly available road infrastructure data and weather feeds to attempt to predict accident risk per road segment, per hour within the state of Utah using supervised machine learning.

(Disclaimer: I’m using a slightly different accident data set that goes back to 2010, but isn’t available online)

Our Approach

We pose the car accident risk prediction as a classification problem with two labels (accident and no accident). It could equally be posed as a regression problem (number of accidents), but on our timescale (one hour) we don’t expect to see more than one accident per road segment so this simplifies the problem a bit. There are of course other approaches, but this is the one we take here. Commonly traffic is modeled by a Poisson or Negative binomial model. Choosing a small road segment and time interval allows us to treat each observation as a Bernoulli random variable (and thus use a cross-entropy loss function as the objective)

We can use the seven years and roughly half a million car accident records as our positive examples. You might then ask, you have positive labels, where are your negative labels? Great question! Every single road segment/hour combination is a possible negative example. Over 7 years and 400,000 discrete road segments, this amounts to roughly 24.5 BILLION potential negative examples.

Machine learning practitioners will notice an issue here, namely, class imbalance. SEVERE class imbalance. Essentially, if we were to use all of this data to train a model, our model would be heavily biased towards no accidents. This is a problem if we want to estimate accident risk.

To counter this problem, we instead don’t use all 24.5B negative examples, instead we take a sampling approach that will be detailed later in this article.

Data Exploration

What do nearly half a million accidents look like?

That’s not a picture of the roads: this is just a heatmap of car accidents

Anyone who commutes definitely understands the impact of time on car accidents. We can visualize what this looks like for 7 years of accidents on the following chart:

This follows our intuition: accidents are occurring mostly on weekday afternoons during peak rush hour. Another observation by looking at the vertical cross section is that accidents tend to peak in the December/January time-frame. Utah has frequent heavy snow and ice during this time so this is certainly not unexpected. This highlights the importance of good weather data as an input to this model. Utah sees about 15 accidents on an average day at the peak of rush hour.

Exploration of some of the data using ArcGIS Pro

The inputs?

Now that we know what we want to predict, what are the inputs? What could cause a car accident. The answer, of course, is numerous factors, some of which we include in this analysis.

**Left:** Accidents typically cluster around intersections, particularly signalized ones. **Right:** Accidents happening more frequently in curvy road segments

Weather (temperature, wind speed, visibility, rainy/snowy/icy, snow depth, rain depth, etc)
Time features: hour of day, day of the week, month of the year, solar azimuth/elevation
Static features such as speed limit, road curvature, average traffic volumes, proximity to intersections, road north/south/east/west alignment, road width, road surface type, etc
Human factors such as population density and distractions such as billboards
Graph derived features of the road network such as centrality and flow
Countless others

This is where the geospatial portion of this analysis becomes important. This is inherently a spatial problem and our machine learning model needs to take into account many different geospatial data sources and their relationship with each other. This will include many geoprocessing operations, which can be computationally expensive. For this purpose, I’m using the ArcGIS platform.

(Disclaimer: I’m an Esri employee, but I came from an open source geo background. You can definitely perform most of this analysis without using ArcGIS, but it will be more challenging. If you aren’t an ArcGIS user, I still highly recommend taking a look at the ArcGIS API for Python, if only for the data wrangling capabilities as much of this data is available from various ArcGIS based services. The developers of this API have taken care to fallback on many open source utilities, such as shapely when ArcGIS isn’t available. A spatial database such as PostgreSQL with PostGIS will go a long way.)

There are really two distinct parts of the inputs: the static features and the dynamic features.

The static features are the parts of the input data that, for the most part, do not change with time. This includes features derived from the road geometry, such as curvature or other properties, such as speed limit or population density. Of course, these aren’t static per say, but they are slowly changing so we can treat them as constant for all intents and purposes.

The dynamic features change depending on when we are making the prediction. These are the weather feeds, solar geometry, and time variables (hour, month, day, etc).

We need to compute all of these features for each road segment, of which we have around 400,000. We scripted this process using the Arcpy Python library included with ArcGIS Pro. Let’s take a look at an example of that really quick:

billboards_url = 'https://maps.udot.utah.gov/arcgis/rest/services/FI_Mandli2012/MapServer/2'# Calc proximity to billboard
_ = arcpy.analysis.Near('centerlines_merged',
                        billboards_url)
_ = arcpy.management.CalculateField('centerlines_merged','proximity_to_billboard','!NEAR_DIST!')
_ = arcpy.management.DeleteField('centerlines_merged',['NEAR_DIST','NEAR_FID'])

The above code snippet used the “Near” tool to find the proximity to the nearest billboard for every road in our centerlines data. We computed numerous proximity based features to form our static feature dataset.

Proximity to billboards. This feature captures some possible driver distractions, but also represents heavily traveled areas because advertisements are intended to be seen by a large audience.

We also had to derive features from the road geometries themselves. For example to estimate road curvature we used “sinuosity” as the metric. Sinuosity is the ratio between the path length and the shortest distance between the endpoints. Again, using Arcpy, we calculated this:

# Calc Sinuosity
code_block = \
'''
import math
def getSinuosity(shp):
    x0 = shp.firstPoint.x
    y0 = shp.firstPoint.y
    x1 = shp.lastPoint.x
    y1 = shp.lastPoint.yeuclid = math.sqrt((x0-x1)**2 + (y0-y1)**2)
    length = shp.length
    if euclid > 0:
        return length/euclid
    return 1.0
'''
_ = arcpy.management.CalculateField('centerlines_merged','sinuosity','getSinuosity(!Shape!)',code_block=code_block)

Let’s talk about weather now. There are many different weather feeds but we chose to use a reliable hourly weather source from NOAA. We have a handful of weather stations, but we need to know the weather at each road segment. One approach is to interpolate the weather at the surface stations to the individual road segments. To do this, we can use a technique known as “kriging” in the geostatistics community or Gaussian process regression in the machine learning community. ArcGIS has a built in “empirical Bayesian kriging” tool in the geostatistics toolbox that contains a robust implementation of this technique that uses an empirical prior distribution based on the data and removes a lot of the parameter twiddling. If this isn’t an option for you, there are other techniques such as inverse distance weighting or simple spatial joins (I did this initially for simplicity’s sake). If you have other data to estimate more precisely how geography affects weather features (such as elevation or more sophisticated climate models), you could load those into a geographically weighted regression model to gain even more accuracy.

Temperature interpolated between weather stations

In summary, there are numerous spatial operations performed to construct a useful feature set for accident prediction. These features will then be used to create a training set for the supervised machine learning model.

Building the training set

With the geospatial processing completed, we can turn our attention to actually building the training set. For this we use the ArcGIS Python API, pandas, and a few other miscellaneous Python libraries. Most of this is standard data wrangling, but a critical piece of any of this working is the creation of negative samples; that is, what are counter examples to when accidents occurred.

One approach is to randomly sample some number of roads/times when accidents didn’t occur, but this has some shortfalls. There are a lot of times and roads when accidents simply do not occur often, but the much more important problem to solve is differentiating accident vs not on roads where accidents happen frequently. What is it that causes accidents?

We chose to use a sampling approach that builds up a set of negative examples that are very similar to our positive examples so that the machine learning model can learn to find the fine differences between when there is and isn’t an accident. Of course, there’s an element of randomness to it as well so we also sample situations that are very different. The approach is as follows:

Randomly select an accident record from positive examples
Randomly alter: the road segment, the hour of the day, or the day of the year.
If the new sample isn’t with in the accident records, add to the list of negative samples
Repeat until we have a large number of negative samples (a few times the number of positive samples)

This gives us a training set that is challenging to work with because it’s very hard to tell the positive and negative examples apart. This isn’t an issue — this is a difficult problem and we’re not concerned with making our numbers look good, we care about practical results. This is a commonly used approach for situations such as this.

As shown in the above gist, categorical variables such as hour, weekday, and month are one-hot encoded. All continuous variables are transformed to z-scores using the scikit-learn StandardScaler. We also transform the sinuosity with a logarithmic transform since most values are near 1 and we want to pick up more on the small difference than the large ones (long windy roads).

The Model

My go to machine learning approach is gradient boosting, particularly with the XGBoost library. The approach builds upon a very intuitive machine learning concept called a decision tree. Decision trees work by finding splits on different features that separate the labels. Unfortunately, decision trees have a tendency to over-fit to a training set, meaning they don’t generalize to new data, which is a crucial part of a predictive model. Gradient boosting works by combining the results from many different decision trees together and is extremely fast and powerful. Gradient boosting often times outperforms other approaches for many different problems and should be in any data scientist’s toolbox. XGBoost is a particularly good implementation of gradient boosting. We also trained other models, including a deep neural network, but found that not only did gradient boosting give the best overall performance (ROC AUC), it gave us more insight into why decisions where made. It should be noted that the deep neural network we built achieved higher recall at equal precision, but the curve was steep and the ROC AUC was a little smaller.

I won’t go into detail on the hyperparameter optimization and training, but here are the final model choices:

params = {
    'max_depth':6,
    'min_child_weight': 5.0,
    'reg_lambda': 1.0,
    'reg_alpha':0.0,
    'scale_pos_weight':1.0,
    'eval_metric':'auc',
    'objective':'binary:logistic',
    'subsample':0.8,    
    'eta':0.3
}

We trained until convergence with 10 rounds of early stopping, achieving a final ROC AUC of around 0.828 on a holdout set. The final model had 80 trees.

With a threshold of 0.19, we get the following performance characteristics:

Test Accuracy: 0.685907494583
Test F1: 0.461524665975
Test Precision: 0.311445366528
Test Recall: 0.890767937497
Test AUC: 0.828257459986
Test AP: 0.388845428164
Train Accuracy: 0.68895115694
Train F1: 0.466528546103
Train Precision: 0.314947399182
Train Recall: 0.899402111551
Train AUC: 0.836489144112
Test AP: 0.410456610829

Like I said earlier, we intentionally made the training set difficult to separate because we wanted the results to reflect the true performance of the model. The recall value of 0.89 means we are able to predict nearly 90% of car accidents, and the precision value of 0.31 means we are correct about those predictions about 30% of the time. It’s not perfect, but it’s a great start and definitely tells us something about our ability to predict accidents. There are numerous things we can do to improve the performance of this model, and perhaps I’ll revisit this in a future post.

Some Results

Great, now we have a model that achieves decent performance, what can we gather about how this model makes predictions? What features are used? What values are important to consider?

XGBoost Feature Importance (How often do they appear in a tree to make decisions?)

You can see from the above plot that solar position and temperature are most important. There are several reasons why these are important features. Temperature acts as a proxy variable to things that are difficult to measure. For instance temperature has a relationship to the season and time of day during the season. It also tells us how likely the roads are to be icy.

Obviously solar position tells us time of day and season, but it also allows us to model an interesting factor: is the sun in driver’s eyes? Let’s take a look at the split histogram for both solar elevation and road orientation.

**Left:** Notice the peak around 0 degrees elevation? This is near sunrise. **Right:** There are peaks at both 90 degrees and 0/180 degrees, but the peak at 0/180 degrees is higher. This indicates that east/west roads are more often used by the model to make a determination.

Going down the list we notice population density, accident_counts, proximity_to_billboard, proximity_to_signal are important features. These certainly make sense from our analysis. You may notice hour, weekday, and month are broken into multiple features through one-hot encoding. If we aggregate the importance from each hour, they actually make up a large contribution to the model. Less common weather features, such as icy (not often reported in our weather feed) and hailing appear in the model, but not as useful for making predictions.

What about the temporal performance of our model? Can it make predictions over time?

It’s not perfect, but for the most part we’re doing a good job. We were fairly far off around Feb 25th, but that was a particularly icy/snowy day and there wasn’t enough weather information to make that determination.

Keep in mind that this is not a time series model, we are simply estimating the expected counts for all road segments and aggregating them. We perform a mean/variance adjustment to the resulting time series since we had a training set biased towards accidents.

Finally, let’s take a look at the resulting model spatially.

Generated Risk Map vs Real Accidents: A particularly icy hour/day in December. Only one of these accidents occurred on a low risk segment.

Same hour, different area. This time all accidents occurred on high risk segments.

You’ll notice from the above graphics that there are plenty of places we’re predicting as high risk, but there are no accidents there. We are attempting to quantify risk using this model, but accidents are still a random occurrence. We believe with more features, especially real time traffic information, construction, important events, higher resolution weather, etc. that we can improve this model significantly, but it seems good for a first start.

Why do we care?

We built a decent model to predict accident risk, but what can we do with it? There are numerous possible applications, including the following that we have considered for applications:

Safe route planning
Emergency vehicle allocation
Roadway design
Where to place additional signage (e.g. to warn for curves)

Some engineers at Esri took this example and built a prototype safe routing app! It’s a pretty cool concept, take a look at it here.

That’s all I have today, thank you for reading! Have a better approach? Are we doing something poorly? Post in the comments below!