Mapping the UK’s Traffic Accident Hotspots

João Paulo Figueira
Towards Data Science
5 min readAug 19, 2018

--

While looking for some interesting geographical data to work with, I came across the Road Safety Data published by the UK government. This is a very comprehensive road accident data set that includes the incident’s geographical coordinates, as well as other related data such as the local weather conditions, visibility, police attendance and more. There is data available for as far back as 2009, and all the way up to 2016, so this is a very interesting data set to explore for geographical and machine learning purposes.

When I looked at the data, I immediately thought about visualizing it on a map. Maybe it would be possible to figure out the areas with higher traffic accident density? The naïve approach of dumping all the points into an interactive map proved impossible due to the sheer size of a year’s worth of data, over 136K points for 2016 alone. From my experience, that is way too much for an interactive web-based map. Another approach is needed here, and the obvious idea is to explore density-based clustering algorithms for this purpose.

Contrary to other clustering algorithms that partition the whole input space into complementary areas or clusters, here we are only concerned about the areas where the traffic accident density is higher while discarding all other points as noise. For each of these higher density areas, we will create a geofence that acts as an envelope around it and uses it as a graphical representation of the points contained therein. This new geographical entity, a polygon, can be stored on a geographic database and later be used for purposes such as driving assistance when negotiating the very busy streets of the United Kingdom. Imagine an extra feature of your vehicle’s GPS system that would notify you about entering a road accident hotspot, much like when it warns you off about approaching speed cameras. I bet that you would try to drive safer there, wouldn’t you?

Let’s imagine that we are tasked to implement such a system using this data from the UK Government. We must somehow convert that long sequence of geographic locations into geofences that encircle the areas with a higher geographic density of road accidents. With a geofence — a polygon expressed in geographic coordinates — we can easily test whether your vehicle is approaching one such hotspot, if it has entered it, or left it.

Density-based Clustering

So in order to detect the accident hotspots, we must find the areas of a high density of accident locations, and draw a polygon around each one of these. First, we must clarify what we mean by density, and how to measure it. Also, we must understand what to do with low-density areas.

The fundamental idea behind density-based techniques for data analysis is that the dataset of interest represents a sample from an unknown probability density function (PDF), which describes the mechanism or mechanisms responsible for producing the observed data. [1]

Here, we will define clusters as areas enclosing the high-density areas, while all other points will be considered as noise and thus discarded from the analysis. There are several algorithms that handle this type of data, and the one I selected for this article is DBSCAN [2].

The main reason why we recognize the clusters is that within each cluster we have a typical density of points which is considerably higher than outside of the cluster. Furthermore, the density within the areas of noise is lower than the density in any of the clusters. [2]

You can read a very good description of this clustering algorithm here, and the implementation used in this article was provided by scikit-learn [3]. Note that there are two very important parameters to set: the minimum distance between cluster points (eps) and the minimum number of points per cluster (minPts). To get a better understanding of how these parameters work and how they affect the clustering result, you can set them separately on a notebook cell:

Here, the minPts parameter was creatively named num_samples. These parameters determine what belongs in a cluster and what is to be considered as noise, so these will have a direct impact on the number and size of the final set of clusters.

Running DBSCAN is actually quite simple:

One-liner for running DBSCAN. See the GitHub repository for more information.

After running DBSCAN on the data, we end up with a collection of clusters and their corresponding points. Noise points are marked with cluster number -1 and excluded from our analysis.

Bubbles

Now is the time to do interesting things with these data, and in this article, we will use the shape of each cluster’s cloud to draw a geofence. There are several strategies to do this, like drawing a convex hull or even a concave hull, but here I will use a very simple approach that could be called “coalescing bubbles”. The idea is quite simple: draw a circle around each point and merge them all together. Like this:

The “coalescing bubbles” process of geofence calculation.

This is done as a two-step process, we first “inflate” all location points into circles of a given radius, and then all circles are merged together into a single polygon. The circle creation (buffering) code is as follows:

The projection code is needed to convert between meters, the unit we use for the circle radius, and geographic coordinates (latitude and longitude pairs). As for the radius, we use a value that is smaller than eps by a factor of 0.6 to avoid having very large circles.

Let’s see how this works in code. First, we must group the points by their cluster identifiers. Remember that noise points are marked with -1.

Now we can start the bubbling process. For increased efficiency, I am using the cascaded_union function from shapely.

Bubble creation process

Now we can create a GeoDataFrame using the lists created above, and simply plot it. It’s as easy as this:

Create a geopandas GeoDataFrame and plot it.

Finally, we can send the whole thing to an interactive map with two lines of code:

Show the interactive map.

The full code is available in the associated GitHub repo. Enjoy!

London traffic accident hotspots example using data between 2015 and 2016.

Required Packages

In order to run the notebook, you must first install a few packages, namely geopandas and all dependencies. This is not an easy task but thankfully Geoff Boeing made this task a bit easier for us on his excellent blog post: Using geopandas on Windows.

You will also need to install the descartes package in order to render the polygons on the map.

Finally, you will also need mplleaflet, to render the interactive map on a browser.

References

[1] Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection

[2] Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825–2830, 2011.

[3] Ester, M., Kriegel, H.P., et al. (1996) A Density-Based Algorithm for Discovering Clusters in Large Spatial Database with Noise. KDD, 226–231

[4] Making maps in Python, Michelle Fullwood

--

--