Clustering a satellite image with Scikit-learn

3 min readApr 8, 2018

The case-study is the Bay of Gibraltar which is a British Overseas Territory located geographically south of Spain. I got the Sentinel-2 multispectral optical image from the Sentinel Hub, and took a subset around the area-of-interest with SNAP.

The Near-Infrared band (NIR) in the studied Sentinel-2 image is quite adequate to detect water. Band 8 is the NIR band in Sentinel-2 products and has a 10m resolution. This band is known to absorb water well. Therefore, we propose to discriminate water from land by clustering this band into two classes using the K-means clustering. From this link you can download the subset of the NIR band available in the GeoTiff format, which was processed in the example below.

K-means clustering is one of the most basic unsupervised classification algorithms out there. By unsupervised it means that this classifier doesn’t require a training dataset that was labelled beforehand.

In a nutshell K-means in its initial implementation works as explained in this blog post:

The K-means algorithms starts by initializing randomly as much centroids as the number of clusters we want to eventually obtain.
Each point in the dataset is assigned to the cluster whose centroid is the closest (e.g. Euclidean distance).
At the end of every iteration, the centroid in each cluster is updated to the average of the points classified in that cluster.
The stopping condition is when the clusters aren’t changed.

There is no need to worry about implementing K-means in this tutorial, since we are going to use Scikit-learn which includes many machine learning algorithms, among them the K-means clustering.

Before getting into the heart of the matter, we need to import GDAL and the clustering module from Scikit-learn:

First off, the satellite image is read with GDAL python wrapper, and from it we extract the band we are interested in classifying:

Python-gdal makes our lives much easier by reading the data into a NumPy array which facilitate performing different array operations on it. This will prove useful later when Scikit-learn comes into play to classify the Numpy array:

The classification is performed on the pixel level (i.e. each pixel represents a statistical individual to classify). So prior to the clustering, we first need to preprocess the dataset by reshaping the input image from its original 2D dimensions (“width x height”) to a vector of individuals ([[x1] [x2]….[xn]] where xi is the intensity of each pixel):

Afterwards, we initialize the classifier by providing the number of clusters as input, and we fit it to the preprocessed dataset to cluster it (no training is needed as explained above):

The classified image can retrieved from the labels that were assigned to each pixel. However this labels array is shaped as a vector and need to be reshaped as an image (width x height):

We return to GDAL to save the image as a GeoTiff. Similarly to when the original NIR image was opened, we start by creating a dataset with the same dimensions as the input image. Then we save the clustered image array as an individual band in it:

Don’t forget to clap if this story has been useful to you.

Clustering a satellite image with Scikit-learn

Written by Hakim Benoudjit