Geospatial Clustering: Types and Use Cases

Deep dive into all the different kinds of clustering with their use cases.

Anubhav Pattnaik
Locale
7 min readJan 21, 2020

--

Previously, we put together a guide for anyone to step into the geospatial world with the right tools, libraries, and datasets. You can check it out here:

In this article, we have covered the types of geospatial clustering, the use cases, the pros and cons of all the different types.

Geospatial Clustering

Geospatial clustering is the method of grouping a set of spatial objects into groups called “clusters”. Objects within a cluster show a high degree of similarity, whereas the clusters are as much dissimilar as possible. The goal of clustering is to do a generalization and to reveal a relation between spatial and non-spatial attributes.

Let’s understand spatial clustering with a small example.

Suppose you are the head of a food delivery chain in a metro city and you want to find out the preferences of the customers so that you can scale up your business. Since it’s not feasible to look at the details of each and every customer hence you segregate them into groups and strategize a business plan for each of the groups/cluster .

Spatial clustering can be divided into five broad types which are as follows :

1. Partition clustering

2. Hierarchical clustering

3. Fuzzy clustering

4. Density-based clustering

5. Model-based clustering

Let’s take a deep dive and understand each of them in detail!

Partition Clustering

A partition clustering is a segregation of the data points into non-overlapping subsets (clusters) such that each data point is in exactly one subset. Basically, it classifies the data into groups by satisfying these two requirements :

1. Each data point belongs to one cluster only.

2. Each cluster has at least one data point.

Partition clustering are of three types: K-means clustering, K-medoids clustering/PAM and CLARA (Classification Large Application )

1. K-means Clustering

K-means clustering is a partitioning method and this method decomposes the dataset into a set of K-partitions based on their attributes. You can read more about k means here.

Source

2. K-medoids clustering/PAM

K-medoids clustering is a partitioning method like K-means clustering. A medoid is a point in the cluster that has minimum dissimilarities with all other points in the cluster. Check this out to know more about PAM.

Source

3. CLARA

This is an extension of the PAM method for large datasets. Instead of finding medoids for the entire data set, CLARA considers a small sample of the data with fixed size and applies the algorithm to generate an optimal set of medoids for the sample.

Source

Use Cases:

Usually, partition-based clustering is used to find groups that have not been explicitly labeled in the data. It helps to assign any new data point to the correct cluster. Businesses use partition-based clustering to segment purchase history, group inventory by sales activity, identifying groups in health monitoring, etc.

Uber using clustering for intent detection which they use for their one-click chat/suggested reply system [Source]

Hierarchical Clustering

Hierarchical clustering is a clustering method like partition-based clustering but the way it classifies the data points is different. It first considers each data point to be a separate cluster. Then merges the most similar cluster which is close to each other. It keeps iterating till all clusters are merged.

The two types of hierarchical clustering are as follows: Agglomerative and Divisive

Agglomerative Hierarchical Clustering

This works with a simple algorithm by which the proximity matrix of points/clusters is calculated. At every iteration, it merges the closest points/clusters are merged and the proximity matrix is updated. This is continued till one cluster or k-clusters are formed. This can be considered as a “bottom-up” approach.

Source

Divisive Hierarchical Clustering

This works exactly opposite to Agglomerative clustering. In this method initially, all data points are considered to belong to a single cluster. In each iteration, the data points which are not similar are separated from the cluster. Each separated data point is considered as an individual cluster. This process continues until we have K-clusters. This can be considered as a “top-down” approach.

Source

Use Cases:

Though hierarchical clustering can be computationally expensive but produces intuitive results. It is useful when not much is known about the data at hand since hierarchical clustering doesn’t need many assumptions. One real-life scenario where hierarchical clustering is extensively is for the mapping of viruses during the epidemic spread and for customer segmentation in the banking industry or retail industry.

Using hierarchical clustering to cluster US Senators into their respective parties. Reds are Republicans, Blues are Democrats, Blacks are independent [Source]

Fuzzy Clustering

In fuzzy c-means clustering, we find out the centroid of the data points and then calculate the distance of each data point from the given centroids until the clusters formed becomes constant. It is different from partition-based clustering in a way that it allows data points to be partially classified into more than one cluster.

Fuzzy clustering of satellite imagery [Source]

Every data point can theoretically belong to all groups with a membership function ranging between 0 and 1. 0 is where the data point is at the farthest possible point from a cluster’s center and 1 is where the data point is closest to the center.

Use Cases:

Fuzzy clustering is useful when you need to do image segmentation or when your goal is to segment water, vegetation and rock areas in satellite images. It is useful in cases where the number of clusters can’t be decided apriori. In such cases, clusters with weak boundaries can be merged. Fuzzy clustering is computationally expensive as compared to K-means since for each point is calculates the probability of it belonging to each cluster.

Density-based Clustering

Density-based clustering works by grouping regions of high density and separating them from regions of low density. The most well known density-based clustering algorithm is the DBSCAN algorithm (Density-based spatial clustering with the application of noise ).

The density is calculated by using two parameters which are as follows

1.EPS: This defines the neighborhood around the data point i.e if the distance between two points is less than or equal to eps then they are said to be neighbors

2. MinPts: THis defines the minimum number of data points that form a neighborhood. The size of the dataset and the value of MinPts are directly proportional.

DBSCAN algorithm visits every point and if it contains MinPts within eps then cluster formation starts. Any other point is defined as noise. This process continues till a density connected cluster is formed and then it restarts with a new point.

Left figure: Traditional clustering; Right figure: DBSCAN clustering (DBSCAN allows points to take up any shape or dimensionality to form clusters )

Use Cases:

DBSCAN is mostly used for clustering in planar space. Good results can be achieved if it is used for mapping the effect of natural disasters or plotting the location of weather stations in a city. This can also be used when the data is composed of non-discrete points and is good for handling outliers. Recommender engines/systems make use of DBSCAN to recommend products/shows to their customers.

Model-Based Clustering:

This method of clustering uses certain models for clusters and tries to optimize the fit between the data and the models. In the model-based clustering approach, the data are viewed as coming from a mixture of probability distributions, each of which represents a different cluster. In other words, in model-based clustering, it is assumed that the data are generated by a mixture of probability distributions in which each component represents a different cluster. Each component (i.e., cluster) is modeled by either a normal distribution or Gaussian distribution.

Expectation-maximization is a well-known model-based clustering algorithm. A particular clustering algorithm is said to work well when the data conforms to the model.

Use cases:

This is useful in incases where clusters have weak borders and data points have mixed membership between clusters. It is also much more flexible in terms of cluster covariance. Cluster assignment is much more flexible in this case and the clusters take any shape depending on the distribution.

With Locale, we’re committed to making this data accessible to every business with moving assets on the ground! Originally posted here.

Similar Reads:

How Uber uses geospatial data to decide its surge pricing:

Visualizing Tesla Superchargers in France geospatially using Python and Folium, from scratch:

Hi! I am a final year CS undergrad at KIIT University, passionate about data science, economics , climate change , growth hacking and starting up. I love to read anything and everything I lay my eyes and cursor on! You can reach out to me on LinkedIn or Twitter.

--

--