Cluster F(x): Nesting PostgreSQL KMeans in DBScan for responsive maps.

tl,dr; You can play with an example cluster map here.

This is a screenshot! Follow the link above to interact with the map.

Cluster maps are indispensable for analysis and representation of ‘big’ point location datasets. I leave it for another discussion what constitutes as big data and whether PostgreSQL may have an edge over the various noSQL platforms.

This article is part of a series which explains the inner workings of XYZ, the open source mapping framework we are actively developing at GEOLYTIX. Please note that at the time of writing this article I still need to update the corresponding documentation on the github repository. All code examples are currently in the dev branch.

XYZ provides endpoints for spatial tables in a PostGIS 2.3+ database. XYZ web applications with a Leaflet.js map interface may consume Cluster, Grid or MVT layers which are requested and processed in the node.js middleware. I have recently written about the nature of XYZ Hex Grid layer here and plan a series on Mapbox Vector Tiles (MVT) layers prior to FOSS4G 2018.

There is a wealth of information in regards to PostGIS clustering on the web. This 2016 blog post by Dan Baston is where we started off. For an excellent introduction (with animations) to the KMeans and DBScan algorithm make sure to read The 5 Clustering Algorithms Data Scientists need to know.

Looking at the GEOLYTIX Retail Points dataset with 13433 retail locations in the UK we create two test cluster.

Fot the DBScan cluster we set a desired distance of 0.23 with a minimum of 1 location in each cluster. This returns 49 cluster from the retail points table. Most of England having a dense distribution of retail locations which are aggregated into one mega cluster with a majority of the remaining cluster distributed on islands and in the Scottish Highlands.

For the KMeans cluster we set a desired return of 50 cluster from the retail points table. These are evenly distributed but islands and coastal areas miss out.

Retail Points / DBScan cluster / KMeans cluster

In a next step we nest the KMeans cluster query within the DBScan query.

We group the cluster in a window function partioned by their cluster id. This applies a distance (0.1) grouping (DBScan) to a set number (10) of cluster which are returned from the KMeans function.

Please ignore the JSON bits for now. We will get on to these later.

450 Cluster are returned from the nested group query. We added a count() to the function in order to set the marker size on the next map. ST_Union() within ST_PointOnSurface() was used to make sure that the point locations were on land.

Retail Points being a GEOLYTIX open source dataset (GPL) we included this table as an endpoint on https://geolytix.XYZ/OPEN.

This link(https) will return a JSON representation of the data layer.

The request parameter are by the geolytix.xyz/open middleware sent on to our database server.

The SQL queries are built, sent and their results processed in the geolytix/xyz/cluster module.

Some work to protect better against SQL Injections, advanced filtering, and a more detailed readme on github are still outstanding.

You can access the geolytix.xyz/open map interface on this link:

Have a play with the cluster settings and explore other datasets within the open locale which can be accessed from the gazetteer.

https://geolytix.xyz/open/?locale=UK

The KMeans is the expected minimum number of cluster groups. These are then further grouped by the DBScan distance. The Retail England Mega Cluster is no more.

The Traffic layer shown on the first screenshot of this article applies a graduate theme to the cluster locations. The colour represents the calculated sum of cars counted in 2016.

The size of a cluster point always represents the count() number of locations that are grouped in the cluster.

The XYZ middleware checks for the number of items contained in the map envelope (bounding box) in order to curtain the max KMeans parameter. A ratio of the data extent and the map envelope is applied to scale the DBScan distance. The Device Pixel Ratio is also applied in order to account for cluster densities on higher resolution mobile devices.

Mobile Display with Window Pixel Ratio 2.

Please check for layer_cluster.js in the xyz development branch on github.

Watch this space for more articles on xyz/cluster and xyz/mvt in the upcoming months.

There will hopefully be a workshop at this years FOSS4G. See me in Dar if you want to talk code or contact GEOLYTIX for a secure hosted solution.

Get in contact first on twitter Dennis Bauszus or @GEOLYTX.