Clustering a satellite image with Scikit-learn
The case-study is the Bay of Gibraltar which is a British Overseas Territory located geographically south of Spain. I got the Sentinel-2 multispectral optical image from the Sentinel Hub, and took a subset around the area-of-interest with SNAP.
The Near-Infrared band (NIR) in the studied Sentinel-2 image is quite adequate to detect water. Band 8 is the NIR band in Sentinel-2 products and has a 10m resolution. This band is known to absorb water well. Therefore, we propose to discriminate water from land by clustering this band into two classes using the K-means clustering. From this link you can download the subset of the NIR band available in the GeoTiff format, which was processed in the example below.
K-means clustering is one of the most basic unsupervised classification algorithms out there. By unsupervised it means that this classifier doesn’t require a training dataset that was labelled beforehand.
In a nutshell K-means in its initial implementation works as explained in this blog post:
- The K-means algorithms starts by initializing randomly as much centroids as the number of clusters we want to eventually obtain.
- Each point in the dataset is assigned to the cluster whose centroid is the closest (e.g. Euclidean distance).
- At the end of every iteration, the centroid in each cluster is updated to the average of the points classified in that cluster.
- The stopping condition is when the clusters aren’t changed.
There is no need to worry about implementing K-means in this tutorial, since we are going to use Scikit-learn which includes many machine learning algorithms, among them the K-means clustering.
Before getting into the heart of the matter, we need to import GDAL and the clustering module from Scikit-learn:
First off, the satellite image is read with GDAL python wrapper, and from it we extract the band we are interested in classifying:
Python-gdal makes our lives much easier by reading the data into a NumPy array which facilitate performing different array operations on it. This will prove useful later when Scikit-learn comes into play to classify the Numpy array:
The classification is performed on the pixel level (i.e. each pixel represents a statistical individual to classify). So prior to the clustering, we first need to preprocess the dataset by reshaping the input image from its original 2D dimensions (“width x height”) to a vector of individuals ([[x1] [x2]….[xn]] where xi is the intensity of each pixel):
Afterwards, we initialize the classifier by providing the number of clusters as input, and we fit it to the preprocessed dataset to cluster it (no training is needed as explained above):
The classified image can retrieved from the labels that were assigned to each pixel. However this labels array is shaped as a vector and need to be reshaped as an image (width x height):
We return to GDAL to save the image as a GeoTiff. Similarly to when the original NIR image was opened, we start by creating a dataset with the same dimensions as the input image. Then we save the clustered image array as an individual band in it:
Don’t forget to clap if this story has been useful to you.