Area Monitoring: How to train a binary classifier for built-up areas

Published in

Sentinel Hub Blog

16 min readFeb 9, 2022

Sentinel-2 observation (2021–06–29) of the region of Spodnja Savinjska dolina, Slovenia, with built-up classifier predictions overlaid in blue.

Why would you train a binary classifier for built-up areas?

Before we dive into the hows, we should probably answer the whys, since these drive the specifics of our effort. We will focus on area monitoring since that is our bread and butter, but the approach is very general and could be easily reused for your own application.

Motivation

Identification of built-up area provides important input to study our impact to the environment, security as well as land administration, particularly in developing countries. Existing datasets in this area (e.g. Global Urban Footprint by DLR, Global Human Settlements Layer by JRC), which provide good quality results, but are extremely lengthy in terms of processing, and more importantly, static in time, which makes them impossible to be used on an ongoing basis. One of the goals of this exercise is to create a workflow where the processing Sentinel-2 pixels to discern built-up areas is pushed to the Sentinel-Hub, where it can be done on demand, and — if the model permits it — at any time and place on Earth. That way, we can provide a scalable built-up area identification solution as part of our Built-Up use-case in the context of Global Earth Monitor project.

Another important use-case for identification of built-up areas is also Area Monitoring, where we are most interested in the monitoring of agricultural parcels (often we refer to them as feature-of-interest — FOI) in the context of “Checks by Monitoring”. Under this umbrella term, we wish to determine the adherence of a parcel to the claimed land use and land cover, through the use of our well established markers.

In some cases it is enough to only find evidence of non-conformance to the claim, while in others it is also important to identify to what land cover/land use the FOI has transitioned to. Some of the most common transitions are:

between agricultural classes,
abandonment of agricultural parcels usually leading to overgrowth,
partial or even full transition of a parcel to the built-up land use class by introducing various non-agricultural artificial structures.

The transition of an agricultural parcel to the built-up land use class is the main motivation for development of the binary classifier we will explore in this blog post.

Requirements

Before we started training the model, we carefully considered the requirements. We wish for the model to be executable:

both on pixel- and object-level,
on a single observation/scene.

These requirements are tied to our desire to transform the model into a custom Sentinel Hub script. This would allow us to use it in the process of downloading signals for agricultural parcels (ie. objects), where pixel-level information is aggregated into data on the objects-level. Such aggregated data can then be fed into our markers. When investigating an agricultural parcel this can serve various use-cases:

the mean of the thresholded values corresponds to the ratio of built-up pixels of a geometry for a particular observations. By applying multi-month temporal averages, we can find parcels which have consistently high built-up ratios, and point them out as likely non agricultural parcels,
Consistently high standard deviation of the thresholded values or the probabilities themselves point to a parcel being heterogeneous.

Below is an example of an agricultural parcel, which would be interesting for both uses.

On the left, a Sentinel-2 visualization is shown, while on the right, the geometry of the FOI (in red) and the area covered by Sentinel-2 pixels (black mesh pattern) are overlayed across digital ortophotography. While the parcel is claimed to be grassland, it is clearly partially built-up, which is something we would like to detect.

How to train your binary classifier for built-up areas?

Now that we know why we are doing what we are doing and which requirements should be met let’s take it step by step and get it done!

Step 1: Training on object- or pixel-level

The first choice we must make is the level of our labeled data. Two apparently distinct, but actually relatively interchangeable options arise immediately:

pixel-level, where each pixel has its own label,
object-level, where a collection of pixels with the same label are aggregated into one training instance with a single label.

At first, both options seem quite different. How can we train a model on pixels and apply it on much larger objects and vice-versa? Actually, once averaging of per-pixel values is performed into a single data point, both approaches are almost completely equivalent. We choose to use the object-level approach because of the following reasons:

Ground truth labels for objects are more readily available, since the majority of developed countries maintain vector layers of various features such as buildings, roads, various agricultural parcels, forests and so on. One could intersect these objects with the Sentinel-2 pixel grid and append ground truth labels based on which object they intersect, but…
slight misalignment of vector layers with actual ground truth (because of badly aligned layers, lack of maintenance, etc.), could lead to large portions of pixels with incorrect labels. Meanwhile, the same misalignment is relatively forgiving in the case of the object-level approach, since the input features are pixel-averaged across geometries. More concretely, if one out of many pixels is actually vegetation and not built-up, its contribution to the mean is proportional to its representation weight and thus more negligible than if it would be mislabeled on its own in a pixel-level approach. Finally, …
creating a collection of geometries is simpler if one just uses or slightly adapts publicly available data. A pixel-level approach would entail much more manipulation of data to extract the pixels as well as their corresponding labels.

There are two downsides that we recognize with the object-level approach. Mainly:

Pixel-level time-series need to be aggregated into objects. You can do that on your own, but that would be relatively painful and slow-going. It is better to use Sentinel Hub’s Statistical API or even the new Batch Statistical API to do the heavy lifting for you,
By aggregating pixels into objects you are losing relatively large numbers of training instances.

I would argue that the second downside is not such a big downside at all. As per the second point in the pros section, we know that pixel-level data will lead to a certain number of incorrect labels. It is always better to train a model on smaller numbers of very high-quality labels rather than training on huge numbers of mediocre-quality data. If ground truth data is perfect, one wouldn’t lose much quality with the pixel-level approach, but that is hardly the case in reality. Following an object-based approach grants you a level of insurance that the quality of labels will be better compared to the pixel-level. Perhaps even more importantly, pixels within a specific object are very correlated, which means there is not much additional information if we use ten or only one of these pixels.

Step 2: Creating reference data and geometries

Now that we have decided to use the object-level approach, we can start building the reference data with the corresponding geometries that will be used to download the time-series later.

We used two sources of geometries when building our reference data. The first is the publicly available vector layer for land-use in Slovenia. From it, we derived geometries for agricultural parcels with the following land uses/land covers:

permanent meadows,
other agricultural parcels (corn, wheat, …).

These are treated separately, since in the latter we expect to observe bare-soil, which could be mistaken for built-up due to its lack of vegetation. Both types are shown superimposed on digital orthophotography of the area (visualized in QGIS on orthophoto) below:

Meadow vectors (top images). Arable-land vectors (bottom). The geometry of each vector is outlined in red, while the surface that is covered with Sentinel-2 pixels is drawn with a black mesh pattern.

We also used the same vector layer to derive geometries for forest and water geometries. Some example vectors for these classes can be seen below:

Water vectors (top images) with rivers (left) and lakes (right). Forest vectors (bottom) with small copses (left) in between agricultural parcels and deeper woods (right).

All of these are of course non-built-up and are treated as such when providing labels to the model later on. Since these are already Polygons there is not much to do but align them to the UTM CRS of the region you will do your training on.

What we are still missing is a vector layer of the various built-up land uses. We recognized a few sub-classes which are mostly distinguished by their ratio of artificial surface to vegetation. Ranked from highest to lowest:

buildings,
building-adjacent areas (parking, driveways, …),
motorways,
smaller primary roads.

These were sourced from publicly available OSM data. Roads in this representation are very long lines. To make them manageable, we first cut them down to size. We provide a handy helper function that we used for this specific task which you can access by clicking this link.

Let us note, that to have distance in units of meters, we need to first project the road line geometries to UTM. We decided to cut the roads to 100 meter chunks. Additionally we buffered the motorways by 10 and primary roads by 8 meters to make actual polygons out of the lines. After this treatment we get nice examples for the built-up class, for which you can see examples overlayed over orthophoto below:

Primary roads (top left), building-adjacent surfaces (top right), buildings (bottom left) and motorways (bottom right).

Finally, we merged the various datasets. We made sure to only include FOIs which have at least one Sentinel-2 pixel. Also based on our experimentation, we left primary roads out of training, since they tended to decrease the performance of the model. As can be seen in the image, a non-negligible portion of them could be covered by trees, which would probably lead to confusion.

After all was said and done, the training dataset was made up of:

12k meadow + arable-land (approximately half/half) objects,
10k building objects,
5k water objects,
5k forests objects,
2k building-adjacent-area objects,
0.2k motorway objects.

Step 3: Download the time-series

Finally, time-series were downloaded using the trusty Sentinel Hub’s Statistical API for L2A data. We choose L2A data since our markers are transitioning to use L2A data, therefore it makes sense the model is congruent with that.

There aren’t many good reasons to implement your own solutions and not use the StatAPI. It returns responses like:

… which contain both temporal information as well as various band/index or other derivative information you define in an Javascript script. Here only the true color bands were requested to serve as an example. You can find a full example of how to download data for a specific object in the Sentinel Hub StatAPI documentation. Such a JSON response is very handy since it can be easily converted into pandas.

Out of all this data, we only need the mean values, since these are equivalent to pixel-level data. If we train a model on the aggregated mean values we can later easily apply it on the pixel-level.

The use of the StatAPI is also very handy, because you can access month- or even year-long time-series very easily and consistently. In our case we downloaded data for the entire year 2019, since we have plenty of intimate knowledge of that time-period. By feeding the model various observations from different time-periods of the year, we can hopefully make it generalize better and allow it to not get hung-up on particular circumstances of an observation or time-frame (like season).

Step 4: Model optimization and training

After building the reference data and downloading Sentinel-2 L2A data for the dataset, we split it into 5 folds based into which H3HEX FOIs fall. This allows to test the geospatial generalizability by training on specific hexes and using the rest as validation. Admittedly Slovenia is quite small and relatively homogeneous, but this might be more important in your case.

We choose to use the LightGBM Classifier model architecture, since we know it quite well from its previous use in the bare-soil, homogeneity and observation outlier models.

Utilizing training on three of the five folds and using the remaining folds as validation, we ran the SequentialFeatureSelector. We allowed the feature selector to utilize all L2A Sentinel-2 bands as well as some interesting indices and feature ratios. A collection of well-established ones can be found in the public Sentinel Hub custom script repository, but you can always design and test out your own. We ended-up with: B01, B02, B03, B04, NDVI, NDWI, NDVI_GREEN, NDVI_RE1, NDVI_RE2, NBSI, CL_GREEN and STI.

Finally, to get a clean test-set predictions without losing any labels, we trained the model on four folds and predicted on the fifth, rotating the training and prediction folds. This enabled us to get predictions for all FOIs, without contaminating the prediction with a model that saw the specific FOI during training. The metrics we get by evaluating the performance with a 0.5 built-up threshold are:

Metrics table evaluated on the observation level. The default 0.5 threshold is used.

Obviously the metrics look very encouraging, but it is always nice to get visual confirmation as well so we plotted the confusion matrix and ROC curve below:

ROC curve evaluated on the observation level shown on the left with the corresponding AUC. Confusion matrix evaluated on the observation level with the default 0.5 threshold is plotted on the right.

As clearly observed, the model performs very well on the dataset we have built for training and evaluating the model, even with the default threshold. However, since we are keen to increase precision we will likely use a 0.8 threshold in production.

How to evaluate your built-up binary classifier in the wild?

While evaluating a model on a test dataset is certainly important, it often does not tell the entire story. It is best to evaluate the models performance and get an insight into its biases by evaluating the model on a broad spectrum of pixels the model has never seen before.

You could in principle repeat the entire process for another region or for another year, however this is time-, money- and labor-intensive. We suggest to use the extremely powerful Sentinel Hub EO Browser. At a few convenient clicks of a button, you can view the the Earth through Sentinel-2 at any time-period and any location, at a minute fraction of time-investment compared to what would otherwise be needed if you downloaded the data like we did for training.

There is only one hiccup we need to resolve before we can use this beautiful tool. EO Browser visualizes observations based on a set of instructions given to it via scripts like the ones used to download the data above, therefore we need to transform our model to the expected format.

Step 6: Converting the model into an evalscript

At the notebook we linked above, you can find a useful code snippet, where you specify your model location, its inputs and other parameters of the parsing. Afterwards the decision trees of the LightGBM model are parsed into a Javascript script compatible with what Sentinel Hub expects. This is another advantage of using the LightGMB architecture since it is basically a random forest model on steroids and can in principle be “stringified” quite easily.

Step 7: Checking the behavior of the model

You are done! What remains is inputting the evalscript into the EO Browser and be mesmerized for the next day or so. You can make your life easier and create a custom EO Browser configuration that holds your evalscript-defined WMS layer so you can access the results a bit quicker. You can create it by following the instructions located at this link!

Of course it usually takes a few iterations to get the model right, which makes the iterative build dataset -> training -> evalscript + EO Browser -> amend dataset approach so valuable. One cannot stress enough how the inclusion of EO Browser into the workflow makes the whole process easier and more intuitive.

As an example, the final dataset we highlighted above was built iteratively. We were not THAT smart to include all sub-classes from the very start. Our first attempt was to train the model on agricultural parcels + various artificial surfaces, without forests and water. The result visualized by EO Browser and animated as a GIF are as follows:

Sentinel-2 L2A True color comparison with intermediate (no forests and water) and final models. Clear improvement in forests and in the river is shown.

While built-up pixels successfully trigger the model, so do forests and the river flowing through the town. This was not reflected in the evaluation of the model at first, since neither the train nor the test dataset contained such FOIs. This stood out only after we visualized the model as an evalscript in EO Browser! One cannot underestimate our brains ability to find patterns in visual data, so we can quickly find where a is lacking.

Furthermore, since we didn’t have forests and water in the training/test datasets we couldn’t have known it would perform relatively poorly on these classes of pixels. This just goes to show how important it is that the training and test datasets are in line with reality or your measured performance will not reflect the performance in production.

Let’s look at our model!

Since you came this far, you will now be rewarded with some beautiful images and also a link to play with the model whenever and wherever you want.

DISCLAIMER: Keep in mind, that the model was trained on Slovenia data and we have shown it works well on Slovenia data. We expect that it would struggle in dryer climates and in areas with different soil types.

This specific assumption mostly stems from the fact, that we noticed activations of the built-up model that coincide with bare-soil observations. This can quite clearly be shown on a larger scale using EO Browser. By clicking on the image below, you should be able to play with the model yourself inside the EO Browser. It might take a second or two to load, since a lot of processing is occurring in the background.

Pixel-wise mask where intensity corresponds to the probability a pixel is built-up. Click on the image to play with the model yourself.

Here we show the pixel-level evaluation of the built-up model, where the intensity of the color corresponds to the probability that the pixel is built-up. We can see quite a few activations on the various fields which are bare at that specific moment.

We also show the same area again, but this time the probability is thresholded with a 0.8 threshold. Pixels with probabilities larger than the threshold are assigned a value of one, while others are assigned zeros, effectively producing a built-up prediction mask.

Pixel-wise mask where the coloring of a pixel corresponds to a built-up mask based on a threshold of 0.8.

Obviously there are way less bare-soil pixels mistaken for built-up but some still persist. We think that combining bare-soil with different soil type, which the model has never seen before could make the model struggle even more.

Other than the apparent systematic issues with bare-soil, there aren’t many issues with the model and it seems to work as advertised with the performance evaluation on the object-level. So instead of forcing the reader to read, we would rather offer you some nice images. To still make them a bit more “data sciency”, we took a look at different geographical locations across Slovenia, to get some feeling how well it generalizes in that sense. You can click on the images and it will take you to the area and date where the image was captured inside EO Browser.

Left — . Right - Spodnja Savinjska Region near Celje.

As can be seen, the model picks-up buildings and other artificial structures extremely well. It can clearly distinguish large roads like motorways, but it can also find narrower roads. Of course that is not entirely up to the model, since some luck is also involved. If the road aligns well with the UTM grid and the corresponding Sentinel-2 pixels, the probability it would be picked up by the model is large.

Conclusion

We are extremely happy with how the model has turned out, but perhaps even more importantly, we learned a lot while training it. An iterative approach works extremely well, especially coupled with the capabilities of EO Browser.

The model seems to be a success and hopefully we have highlighted how the reader can build his own from scratch.

Check the Area Monitoring documentation for more information.

TThe project has received funding from European Union’s Horizon 2020 Research and Innovation Programme” under the Grant Agreement 101004112, Global Earth Monitor project.

This post is one of the series of blogs related to our work in Area Monitoring. We have decided to openly share our knowledge on this subject as we believe that discussion and comparison of approaches are required among all the groups involved in it. We would welcome any kind of feedback, ideas and lessons learned. For those willing to do it publicly, we are happy to host them at this place.

The content:

High-Level Concept
Data Handling
Outlier detection
Identifying built-up areas (this post)
Similarity Score
Bare Soil Marker
Mowing Marker
Pixel-level Mowing Marker
Crop Type Marker
Homogeneity Marker
Parcel Boundary Detection
Land Cover Classification (still to come)
Minimum Agriculture Activity (still to come)
Combining the Markers into Decisions
The Challenge of Small Parcels
The value of super resolution — real world use case
Traffic Light System
Expert Judgement Application
Agricultural Activity Throughout the Year