# Using Machine Learning to Predict Car Accident Risk

May 3, 2018 · 12 min read

This article reflects some of the work we are doing at Esri to define spatially focused artificial intelligence and machine learning, however: The opinions in this article are my own, and not necessarily the opinions of my employer. This article is intended to be a simple technical introduction to one application of geo-spatial machine learning, not a fully mature solution. There’s a lot of excitement and some great, truly innovative work going on at Esri. I’m very excited to be a part of it!

Car accidents are a huge problem in our world. Nearly 1.3 million people die internationally every year from car accidents and in addition up to 50 million people are injured (ASIRT). Can machine learning help save lives? I believe the answer is yes, and this article details one possible approach.

Many governments collect accident records and make these data publicly available. In addition, numerous road infrastructure data sets. We will make use of publicly available road infrastructure data and weather feeds to attempt to predict accident risk per road segment, per hour within the state of Utah using supervised machine learning.

(Disclaimer: I’m using a slightly different accident data set that goes back to 2010, but isn’t available online)

We pose the car accident risk prediction as a classification problem with two labels (accident and no accident). It could equally be posed as a regression problem (number of accidents), but on our timescale (one hour) we don’t expect to see more than one accident per road segment so this simplifies the problem a bit. There are of course other approaches, but this is the one we take here. Commonly traffic is modeled by a Poisson or Negative binomial model. Choosing a small road segment and time interval allows us to treat each observation as a Bernoulli random variable (and thus use a cross-entropy loss function as the objective)

We can use the seven years and roughly half a million car accident records as our positive examples. You might then ask, you have positive labels, where are your negative labels? Great question! Every single road segment/hour combination is a possible negative example. Over 7 years and 400,000 discrete road segments, this amounts to roughly 24.5 BILLION potential negative examples.

Machine learning practitioners will notice an issue here, namely, class imbalance. SEVERE class imbalance. Essentially, if we were to use all of this data to train a model, our model would be heavily biased towards no accidents. This is a problem if we want to estimate accident risk.

To counter this problem, we instead don’t use all 24.5B negative examples, instead we take a sampling approach that will be detailed later in this article.

What do nearly half a million accidents look like?

Anyone who commutes definitely understands the impact of time on car accidents. We can visualize what this looks like for 7 years of accidents on the following chart:

This follows our intuition: accidents are occurring mostly on weekday afternoons during peak rush hour. Another observation by looking at the vertical cross section is that accidents tend to peak in the December/January time-frame. Utah has frequent heavy snow and ice during this time so this is certainly not unexpected. This highlights the importance of good weather data as an input to this model. Utah sees about 15 accidents on an average day at the peak of rush hour.

Now that we know what we want to predict, what are the inputs? What could cause a car accident. The answer, of course, is numerous factors, some of which we include in this analysis.

• Weather (temperature, wind speed, visibility, rainy/snowy/icy, snow depth, rain depth, etc)
• Time features: hour of day, day of the week, month of the year, solar azimuth/elevation
• Static features such as speed limit, road curvature, average traffic volumes, proximity to intersections, road north/south/east/west alignment, road width, road surface type, etc
• Human factors such as population density and distractions such as billboards
• Graph derived features of the road network such as centrality and flow
• Countless others

This is where the geospatial portion of this analysis becomes important. This is inherently a spatial problem and our machine learning model needs to take into account many different geospatial data sources and their relationship with each other. This will include many geoprocessing operations, which can be computationally expensive. For this purpose, I’m using the ArcGIS platform.

(Disclaimer: I’m an Esri employee, but I came from an open source geo background. You can definitely perform most of this analysis without using ArcGIS, but it will be more challenging. If you aren’t an ArcGIS user, I still highly recommend taking a look at the ArcGIS API for Python, if only for the data wrangling capabilities as much of this data is available from various ArcGIS based services. The developers of this API have taken care to fallback on many open source utilities, such as shapely when ArcGIS isn’t available. A spatial database such as PostgreSQL with PostGIS will go a long way.)

There are really two distinct parts of the inputs: the static features and the dynamic features.

The static features are the parts of the input data that, for the most part, do not change with time. This includes features derived from the road geometry, such as curvature or other properties, such as speed limit or population density. Of course, these aren’t static per say, but they are slowly changing so we can treat them as constant for all intents and purposes.

The dynamic features change depending on when we are making the prediction. These are the weather feeds, solar geometry, and time variables (hour, month, day, etc).

We need to compute all of these features for each road segment, of which we have around 400,000. We scripted this process using the Arcpy Python library included with ArcGIS Pro. Let’s take a look at an example of that really quick:

`billboards_url = 'https://maps.udot.utah.gov/arcgis/rest/services/FI_Mandli2012/MapServer/2'# Calc proximity to billboard_ = arcpy.analysis.Near('centerlines_merged',                        billboards_url)_ = arcpy.management.CalculateField('centerlines_merged','proximity_to_billboard','!NEAR_DIST!')_ = arcpy.management.DeleteField('centerlines_merged',['NEAR_DIST','NEAR_FID'])`

The above code snippet used the “Near” tool to find the proximity to the nearest billboard for every road in our centerlines data. We computed numerous proximity based features to form our static feature dataset.

We also had to derive features from the road geometries themselves. For example to estimate road curvature we used “sinuosity” as the metric. Sinuosity is the ratio between the path length and the shortest distance between the endpoints. Again, using Arcpy, we calculated this:

`# Calc Sinuositycode_block = \'''import mathdef getSinuosity(shp):    x0 = shp.firstPoint.x    y0 = shp.firstPoint.y    x1 = shp.lastPoint.x    y1 = shp.lastPoint.yeuclid = math.sqrt((x0-x1)**2 + (y0-y1)**2)    length = shp.length    if euclid > 0:        return length/euclid    return 1.0'''_ = arcpy.management.CalculateField('centerlines_merged','sinuosity','getSinuosity(!Shape!)',code_block=code_block)`

Let’s talk about weather now. There are many different weather feeds but we chose to use a reliable hourly weather source from NOAA. We have a handful of weather stations, but we need to know the weather at each road segment. One approach is to interpolate the weather at the surface stations to the individual road segments. To do this, we can use a technique known as “kriging” in the geostatistics community or Gaussian process regression in the machine learning community. ArcGIS has a built in “empirical Bayesian kriging” tool in the geostatistics toolbox that contains a robust implementation of this technique that uses an empirical prior distribution based on the data and removes a lot of the parameter twiddling. If this isn’t an option for you, there are other techniques such as inverse distance weighting or simple spatial joins (I did this initially for simplicity’s sake). If you have other data to estimate more precisely how geography affects weather features (such as elevation or more sophisticated climate models), you could load those into a geographically weighted regression model to gain even more accuracy.

In summary, there are numerous spatial operations performed to construct a useful feature set for accident prediction. These features will then be used to create a training set for the supervised machine learning model.

With the geospatial processing completed, we can turn our attention to actually building the training set. For this we use the ArcGIS Python API, pandas, and a few other miscellaneous Python libraries. Most of this is standard data wrangling, but a critical piece of any of this working is the creation of negative samples; that is, what are counter examples to when accidents occurred.

One approach is to randomly sample some number of roads/times when accidents didn’t occur, but this has some shortfalls. There are a lot of times and roads when accidents simply do not occur often, but the much more important problem to solve is differentiating accident vs not on roads where accidents happen frequently. What is it that causes accidents?

We chose to use a sampling approach that builds up a set of negative examples that are very similar to our positive examples so that the machine learning model can learn to find the fine differences between when there is and isn’t an accident. Of course, there’s an element of randomness to it as well so we also sample situations that are very different. The approach is as follows:

1. Randomly select an accident record from positive examples
2. Randomly alter: the road segment, the hour of the day, or the day of the year.
3. If the new sample isn’t with in the accident records, add to the list of negative samples
4. Repeat until we have a large number of negative samples (a few times the number of positive samples)

This gives us a training set that is challenging to work with because it’s very hard to tell the positive and negative examples apart. This isn’t an issue — this is a difficult problem and we’re not concerned with making our numbers look good, we care about practical results. This is a commonly used approach for situations such as this.

As shown in the above gist, categorical variables such as hour, weekday, and month are one-hot encoded. All continuous variables are transformed to z-scores using the scikit-learn StandardScaler. We also transform the sinuosity with a logarithmic transform since most values are near 1 and we want to pick up more on the small difference than the large ones (long windy roads).

My go to machine learning approach is gradient boosting, particularly with the XGBoost library. The approach builds upon a very intuitive machine learning concept called a decision tree. Decision trees work by finding splits on different features that separate the labels. Unfortunately, decision trees have a tendency to over-fit to a training set, meaning they don’t generalize to new data, which is a crucial part of a predictive model. Gradient boosting works by combining the results from many different decision trees together and is extremely fast and powerful. Gradient boosting often times outperforms other approaches for many different problems and should be in any data scientist’s toolbox. XGBoost is a particularly good implementation of gradient boosting. We also trained other models, including a deep neural network, but found that not only did gradient boosting give the best overall performance (ROC AUC), it gave us more insight into why decisions where made. It should be noted that the deep neural network we built achieved higher recall at equal precision, but the curve was steep and the ROC AUC was a little smaller.

I won’t go into detail on the hyperparameter optimization and training, but here are the final model choices:

`params = {    'max_depth':6,    'min_child_weight': 5.0,    'reg_lambda': 1.0,    'reg_alpha':0.0,    'scale_pos_weight':1.0,    'eval_metric':'auc',    'objective':'binary:logistic',    'subsample':0.8,        'eta':0.3}`

We trained until convergence with 10 rounds of early stopping, achieving a final ROC AUC of around 0.828 on a holdout set. The final model had 80 trees.

With a threshold of 0.19, we get the following performance characteristics:

`Test Accuracy: 0.685907494583Test F1: 0.461524665975Test Precision: 0.311445366528Test Recall: 0.890767937497Test AUC: 0.828257459986Test AP: 0.388845428164Train Accuracy: 0.68895115694Train F1: 0.466528546103Train Precision: 0.314947399182Train Recall: 0.899402111551Train AUC: 0.836489144112Test AP: 0.410456610829`

Like I said earlier, we intentionally made the training set difficult to separate because we wanted the results to reflect the true performance of the model. The recall value of 0.89 means we are able to predict nearly 90% of car accidents, and the precision value of 0.31 means we are correct about those predictions about 30% of the time. It’s not perfect, but it’s a great start and definitely tells us something about our ability to predict accidents. There are numerous things we can do to improve the performance of this model, and perhaps I’ll revisit this in a future post.

Great, now we have a model that achieves decent performance, what can we gather about how this model makes predictions? What features are used? What values are important to consider?

You can see from the above plot that solar position and temperature are most important. There are several reasons why these are important features. Temperature acts as a proxy variable to things that are difficult to measure. For instance temperature has a relationship to the season and time of day during the season. It also tells us how likely the roads are to be icy.

Obviously solar position tells us time of day and season, but it also allows us to model an interesting factor: is the sun in driver’s eyes? Let’s take a look at the split histogram for both solar elevation and road orientation.

Going down the list we notice population density, accident_counts, proximity_to_billboard, proximity_to_signal are important features. These certainly make sense from our analysis. You may notice hour, weekday, and month are broken into multiple features through one-hot encoding. If we aggregate the importance from each hour, they actually make up a large contribution to the model. Less common weather features, such as icy (not often reported in our weather feed) and hailing appear in the model, but not as useful for making predictions.

What about the temporal performance of our model? Can it make predictions over time?

Keep in mind that this is not a time series model, we are simply estimating the expected counts for all road segments and aggregating them. We perform a mean/variance adjustment to the resulting time series since we had a training set biased towards accidents.

Finally, let’s take a look at the resulting model spatially.

You’ll notice from the above graphics that there are plenty of places we’re predicting as high risk, but there are no accidents there. We are attempting to quantify risk using this model, but accidents are still a random occurrence. We believe with more features, especially real time traffic information, construction, important events, higher resolution weather, etc. that we can improve this model significantly, but it seems good for a first start.

We built a decent model to predict accident risk, but what can we do with it? There are numerous possible applications, including the following that we have considered for applications:

• Safe route planning
• Emergency vehicle allocation
• Where to place additional signage (e.g. to warn for curves)

Some engineers at Esri took this example and built a prototype safe routing app! It’s a pretty cool concept, take a look at it here.

That’s all I have today, thank you for reading! Have a better approach? Are we doing something poorly? Post in the comments below!

## GeoAI

Geospatial Artificial Intelligence: thoughts about where AI…

## GeoAI

Geospatial Artificial Intelligence: thoughts about where AI and GIS intersect

Written by

## Daniel Wilson

I’m the Co-founder and CTO of SeerAI

## GeoAI

Geospatial Artificial Intelligence: thoughts about where AI and GIS intersect