HIECTOR: Hierarchical object detector at scale

Published in

Sentinel Hub Blog

17 min readMay 5, 2022

--

Introducing a hierarchical object detection workflow for satellite imagery at large scale.HIECTOR facilitates multiple satellite data collections of increasingly detailed spatial resolution for a cost-efficient and accurate object detection over large areas. We open-source both the code, which is based on eo-learn, and the pre-trained weights to reproduce our prototype. HIECTOR was developed within Query Planet, an ESA/ESRIN-funded project supported by the Φ-lab.

Written by Devis Peressutti. Work performed by Devis Peressutti, Nejc Vesel, Sara Verbič, Matej Aleksandrov, Žiga Lukšič, and Matej Batič.

TL;DR:

We present a hierarchical framework for performing object detection on satellite imagery at scale. HIECTOR uses multiple satellite data collections of increasing spatial resolution to achieve large cost savings, while seeking to preserve high detection accuracy. HIECTOR estimates objects as Oriented Bounding Boxes, and is designed to be agnostic to data collections and detected objects. We evaluated HIECTOR for building detection at the national scale using Sentinel-2, Airbus SPOT and Airbus Pleiades data collections and demonstrated that adding detection on Sentinel-2 leads to a cost saving of more than 60% with negligible loss of accuracy. We performed a thorough analysis of the cost saving vs accuracy trade-off, and open-sourced code and pre-trained weights required to replicate our findings.

Introduction

Object detection (OD) is one of the most common tasks in computer vision and image processing, seeking to identify and locate instances of semantic objects in a given image or video. In practice, OD methods estimate bounding boxes surrounding each object, as well as estimate which semantic class the objects belong to. If you are a scientist working in remote sensing and processing satellite imagery, you might have crossed paths with OD seldomly, definitely less than with semantic image segmentation, i.e. classifying each image pixels, or with image scene classification, i.e. classifying the content of an entire image scene. The reasons for this are quite simple. OD in satellite imagery aims at detecting man-made objects, such as cars, airplanes, bridges, buildings, ships, which require imagery of high spatial resolution, i.e. lower than 3 m. Such Very High Resolution (VHR) imagery is only commercially available at high prices, limiting the application of OD to small areas, or to out-dated aerial imagery.

HIECTOR aims at performing OD in satellite imagery at large scale, e.g. think national, continental, and why not, global, in an efficient and cost-effective manner. To achieve this, HIECTOR works in a hierarchical fashion, finding areas of interest in imagery of low spatial resolution, and sequentially performing OD in imagery of increasingly higher spatial resolution. The main assumption behind HIECTOR is that objects of interest are highly sparse, and therefore buying and processing VHR for the entire area-of-interest (AOI) is inefficient and resource-wise wasteful. HIECTOR was designed to be agnostic to the multi-resolution data collections used, as well as the types of detected objects. However, we have evaluated HIECTOR for building detection, using three data collections, namely ESA Sentinel-2 (pixel size of 10 m), Airbus SPOT 6/7 (pan-sharpened pixel size of 1.5 m) and Airbus Pleiades (pan-sharpened pixel size). As a side note, we will try not to mix the terms spatial resolution and pixel size too much because we are aware of the conundrum of the terms.

Figure 1. Example animation showing the three data collections used for the evaluation of HIECTOR for building detection. The images show the Red-Green-Blue bands for Sentinel-2 (Copernicus Sentinel data 2020), SPOT (© AIRBUS DS 2020) and Pleiades (© CNES 2020, Distribution AIRBUS DS) data collections of the same location. The animation highlights differences in spatial resolution, but also differences in acquisition viewing angle, spectral response and geo-location accuracy.

HIECTOR: how it works

HIECTOR is the result of two main contributions:

The hierarchical divide-and-conquer approach, which applies OD at each hierarchical level, and proceeds with ordering, buying and processing imagery at higher resolution only for the areas deemed necessary by the precedent level. In concrete terms, given a large AOI for which we want to detect buildings, we first detect built-up area in Sentinel-2 imagery using one OD method, we then order, buy and run OD on Airbus SPOT imagery over the detected built-up area only. OD on SPOT imagery detects reliably large to medium-sized buildings, while it is less accurate for small buildings in densely packed areas, due to limitations of SPOT imagery spatial resolution. For such less accurate areas, which can be determined from SPOT predictions, we proceed with ordering higher spatial resolution imagery, e.g. Airbus Pleiades, and applying OD on such areas only. The final detected buildings are a combination of detections from SPOT and Pleiades imagery. Given that SPOT imagery is roughly an order of magnitude cheaper than Pleiades, and that Pleiades is processed for a small subset of the entire AOI, large cost savings are achieved compared to using Pleiades for the entire AOI. This hierarchical approach can be seamlessly implemented using Sentinel Hub functionalities, including the ordering, ingestion and processing of several commercial VHR optical imagery through the Third-Party Data Import API.
The detection of buildings and built-up area as Oriented Bounding Boxes (OBB), using the same OD algorithm. The majority of OD methods in computer vision delineate objects using Horizontal Bounding Boxes (HBB). However, in satellite imagery, objects are naturally not aligned to any axis, and can present any given rotation with respect to the x/y image axes (Fig. 2). In particular for buildings, OBBs represent a better approximation of their footprint compared to HBBs (Fig. 3). Following the evaluation of state-of-the-art methods for OBB detection, we settled on the Single-Stage Rotation-Decoupled Detector (SSRDD) algorithm (paper, code), which ranked high on the DOTA benchmark. We have extended the method to take as input a 4-channel image, where the channels are the Blue, Green, Red, Near Infrared bands of the data collections. For the proposed use-case, we trained one SSRDD model for detection of built-up area on Sentinel-2 imagery, and a separate SSRDD model for building detection on both SPOT and Pleiades imagery.

Figure 3. Histogram of ratios between the area of the original footprint and the area of the derived HBB (in red) and OBB (in green) for the investigated AOI. Values of 1 denote perfect alignment between the derived bounding box and the building footprint, while values tending to 0 denote very poor overlap between the footprint and the bounding box. Low ratio values for HBBs can happen, for instance, for very narrow and elongated buildings with an orientation of ±45° with respect to the image axes. The peak at 0.5 for HBBs confirms the fact that buildings can exhibit any rotation.

Why using OD to detect buildings compared to semantic segmentation, which could provide a better estimate of the footprint? you ask. Good question. There are mainly three reasons why we opted for OD. The first is that for our application, estimating the footprint is not necessary, as we are more interested in accurately locating (new) buildings in a timely way, providing a monitoring service for urban planning and development. Secondly, when processing imagery of different pixel size, OD is a better choice as is, to a certain degree, independent of the pixel size, and allows to estimate OBBs at the sub-pixel level. This means that OBBs estimated on SPOT imagery can exactly match OBBs estimated on Pleiades imagery. Thirdly, deriving vector footprints from semantic segmentation rasters typically requires non-trivial post-processing vectorization techniques, which can be very demanding resource-wise, especially when processing large AOIs. OD on the contrary does not require further post-processing of the estimated OBBs.

The following sections describe how we trained the SSRDD models and how we used HIECTOR for inference at large scale, as well as the OD performance results and cost saving evaluation. Don’t miss the final section where links to the open-source code and pre-trained SSRDD model weights are provided!

AOI and building footprints

The AOI chosen to evaluate HIECTOR is Azerbaijan, a country of approximately 86,600 sqkm. Building footprints for the entire country were provided by the State Service on Property Issues. The footprints database collects information from different sources, which can be outdated, and therefore incorrect. The aim of HIECTOR is to provide accurate and timely updates to such database to inform on the progress of urban development and act on urban planning.

Fig. 4 shows the distribution of the buildings across the country, with overlaid in red areas for which Airbus Pleiades imagery was available for both training and evaluation of the SSRDD algorithm. Despite not being accurate at the building level, the database of buildings could be used to create reference OBBs for training the SSRDD model using Sentinel-2 imagery. In this case, one OBB does not delineate a single building, which cannot be singularly discerned at 10 m pixel size. However, one OBB denotes an area with buildings, with the number of overlapping OBB proportional to their density. Why on Earth would you do that? you ask. Why not!

Figure 4. Distribution of building footprints across the chosen AOI shown in green. Red polygons denote areas for which Airbus Pleiades imagery was available for training and evaluation of `HIECTOR.`

Given that the building database contains many outdated and missing buildings, they could not be used directly to train our SSRDD model that uses SPOT/Pleiades imagery in input. We therefore manually reviewed ~66,000 buildings of the existing database, distributed across the entire AOI, as shown in Fig. 5, covering a total area of 9 sqkm, which is equivalent to the 0.01% of the country’s area. The reviewed areas spanned the main cities and villages, ensuring that the variability in building shapes and appearances were well represented in the reviewed database. We used this set of reviewed footprints to generate the OBBs used as reference for training and evaluating the SSRDD model. In our use-case, a building is defined as a roofed and walled structure built for residential or industrial use. To start with, we didn’t differentiate between different building types, therefore the SSRD model only predicts one class.

Figure 5. Areas over the AOI for which we manually reviewed the existing building footprints. Copernicus Sentinel data 2020.

HIECTOR: training

As for any machine learning method, the quality and quantity of the training data largely influence its performance. In the case of HIECTOR, while Sentinel-2 imagery is distributed freely thanks to the Copernicus programme, commercial imagery is not, so we need to be parsimonious when constructing the training dataset for the SSRDD model that uses SPOT/Pleiades imagery.

For this reason, for training the SSRDD model that infers built-up area on Sentinel-2 imagery, we used images acquired over the entire AOI and the entire database of outdated building footprints to build the training dataset. On the other hand, for the SSRDD model jointly trained on SPOT/Pleiades imagery, we only used images acquired over the areas with manually reviewed buildings, corresponding to 0.01% of the AOI. We will show you that, even though this is a very tiny percentage of the AOI, we can achieve promising detection results.

The workflow for training the SSRDD models is the same, regardless of the data collection and area covered by the image/labels, and is summarised in Fig. 6. Given the AOI (or a tiny subset of it), we split it into a regular grid of a given cell size, allowing us to handle the data more efficiently and parallelize the process. For each cell, we retrieve the data collection (e.g. Sentinel-2, SPOT or Pleiades) and the corresponding building footprints, and we compute OBBs from the footprints as the minimum rotated rectangles, and save these data as EOPatches. Once EOPatches over the AOI are created, we further split them into smaller image chips, to facilitate fast parallel loading during training of the SSRDD model. Image chips of different sizes, e.g. 128, 256, 512 are extracted, and are then resampled to the same size, e.g. 256, effectively creating an artificial multi-scale dataset from the same data collection. The image chips and corresponding labels are then divided into train/validation/test folds, normalization to the mean and standard deviation applied to the image bands, and the SSRDD model is trained until convergence.

Figure 6. Schematic representation of the workflow for training the SSRDD models. The training AOI is split into a grid, and, for each cell, images and OBBs are retrieved. The cell is further split into sub-grids of different sizes to implement an artificial multi-scale dataset. The derived image chips and corresponding OBBs are split into validation folds and the model trained until convergence. © CNES 2020, Distribution AIRBUS DS.

We modified the data loaders to interface with EOPatches, as well as to seamlessly read/write from and to cloud storage, since the amount of data can quickly reach hundreds of Giga Bytes. Training was performed on a single GPU, using the weights of a pre-trained ResNet34 network on natural images as initial training state.

HIECTOR: inference at large scale

Once we trained the SSRDD models (one for detection on Sentinel-2 images, one for detection on SPOT/Pleiades images), we are ready to run HIECTOR on our entire AOI, as shown in Fig. 7.

Figure 7. Schematic representation of the inference workflow. Once the SSRDD models are trained, we split the entire AOI into the first grid (light blue cells), and apply OD at the first level of the pyramid, i.e. built-up detection on Sentinel-2 imagery. At the second level, we use a finer grid (orange cells) splitting only the built-up area detected. For such cells, SPOT imagery is acquired and OD applied. At the third level of the pyramid, we determine areas for which higher resolution imagery would be required (red cells), and proceed to order and run OD on Pleiades imagery. OBB detections derived from SPOT (orange cells) and Pleiades (red cells) are finally combined.

While built-up area detections on Sentinel-2 imagery determine the areas for which we will order and process SPOT imagery, as shown in Fig. 8, we need a way to identify areas where higher resolution imagery, e.g. Pleiades, is required.

Figure 8. Example of estimated built-up polygon (in red) using Sentinel-2 imagery. The OBBs predicted by the SSRDD model on Sentinel-2 imagery with a confidence higher than a given threshold are merged into a single polygon. The finer grid is shown in blue. Airbus SPOT imagery will be requested and processed only for the grid cells overlapping with the estimated built-up polygon. Copernicus Sentinel data 2020.

To determine such areas, we defined a drill-down index based on the OBBs predicted on SPOT imagery as follows

where N is the number of predicted buildings, and P is the confidence score of each estimated OBB. DDI is computed for each cell of a yet finer sub-grid. According to its definition, DDI will be larger for cells with a large number of predicted buildings of low confidence. In practice, using the DDI means that Airbus Pleiades imagery is only ordered and processed for a small fraction of the area of interest, as shown in Fig. 9.

Figure 9. Example of drill down index values for the Baku area, as imaged by SPOT. Areas of densely packed buildings can be noticed in different locations of the city. The overlaid grid shows the values of the drill-down index for each cell of size 80 m. Darker red values denote drill-down index values greater than 1.5. Such values correlate well with areas of smalled-sized dense buildings. On the other hand, areas with small values of drill-down index correlate well with larger buildings or areas sparsely populated. © AIRBUS DS 2020.

The following list summarises the steps for HIECTOR inference at large scale:

Split AOI/country into grid with cell size 10000 m with overlapping cells;
Download Sentinel-2 imagery for each cell;
Perform detection on Sentinel-2 imagery;
Merge predicted OBBs into multi-polygon;
Split multi-polygon into grid with cell size 804 m with overlapping cells (Fig. 8);
Order/ingest SPOT only over the grid defined in step 5;
Perform detection on SPOT imagery;
Further split into grid with cell size 80 m;
Compute DDI, as shown in Fig. 9;
Order/ingest Pleiades for cells with DDI > 1.5;
Perform detection on Pleiades imagery;
Combine SPOT/Pleiades detections obtained in step 7 and 11.

The values of cell sizes and the DDI threshold can be customised to allow ease adaptation of HIECTOR to the detection of different objects than buildings.

The magic stuff that allows to scale HIECTOR to large AOIs happens behind the curtains. Thanks to the relentless divide-and-conquer strategy, we are left with a large number of data chunks for each cell (i.e. EOPatches), which can be processed independently in parallel processes. The SSRDD inference step, in fact, is automatically distributed across a large number of cheap CPU processes/instances using Ray and AWS spot instances. The required pre-processing (splitting into image chips, normalization) is computed on-the-fly, therefore further reducing storage costs. Once OBBs predictions for each cell are generated, duplicates over overlapping areas are removed and remaining OBBs are merged into a single file, automatically handling boring issues like multiple coordinate reference systems.

HIECTOR: evaluation and results

Enough chit-chat. Let’s look at the results!

Let’s start with built-up detections on Sentinel-2 imagery (Fig. 10). Built-up areas are accurately detected, even for few isolated buildings. This is important as we don’t want to loose any building in the first level of the detection pyramid. What is equally important is that false positive detections are low, in particular in deserted or agricultural areas. This ensures that these areas are not processed in the following OD levels, increasing the cost saving related to not buying and processing commercial imagery. When splitting the detected built-up polygons into a finer grid, we retain 37.2% of the entire AOI, losing merely 0.6% of buildings, e.g. false negative detections. This means that, by using Sentinel-2, we process 63% less data with minimal loss of accuracy.

As a comparison, we used the “residential”, “industrial” and “building” layers of OpenStreetMap (OSM) to locate built-up areas. By splitting such areas with the same grid as for Sentinel-2, we found a loss of buildings of 2.9%, compared to the 0.6% obtained using Sentinel-2. In fact, we found complete villages not mapped in OSM (Fig. 11), but correctly detected by Sentinel-2. Reliability and accuracy of OSM layers largely depends on the geographical region, with very accurate maps available for Europe and North America. However, for developing countries, the timely information provided by Sentinel-2 leads to more accurate and up-to-date results.

What about building detections in SPOT and Pleiades imagery? Fig. 12 shows some visual result for SPOT, while Fig. 13 shows detection results obtained using Pleiades imagery. In general, SPOT predictions tend to be more accurate for medium to large-sized buildings, while Pleiades predictions are accurate also on small and densely packed buildings. For both data collections, there is a large number of false positive detections, both for built-up areas other than buildings, e.g. roads, railway, parking lots, and non-built-up areas, such as agricultural parcels or beaches. The latter can be explained by the absence of such examples in the training dataset, while the former is due to a combination of low number of training samples and inconsistency in training labels. There is also an implicit confusion harder to deal with, since some structures, like parking lots, can appear identical or very similar to buildings.

Now that we have looked at some nice pictures, let’s crunch the numbers. Ideally, we want to achieve large cost savings compared to processing the entire AOI with VHR imagery, e.g. Pleiades, while retaining the same detection accuracy. Unfortunately, in real life, the blanket is always short and there is a trade-off between cost saving and detection accuracy, e.g. covering our head or covering our feet.

To measure detection accuracy, we used the mean Average Precision (mAP), which is commonly used to evaluate OD methods, both in natural images, e.g. COCO benchmark, and remote sensing, e.g. DOTA benchmark. Once the SSRDD model generates predictions, only the OBBs with pseudo-probability (Proba) larger than the defined threshold are retained. Given these OBBs, overlaps with the reference OBBs are computed and quantified using Intersection over Union (IoU). If the computed IoU is larger than a pre-defined threshold, the detection is considered as a True Positive (TP), otherwise is a False Positive (FP). Reference labels that don’t overlap with any predicted OBB are considered as False Negatives (FN). Precision and Recall values are computed from TP/FP/FN, and mAP derived from the area under the Precision-Recall curve for a given set of IoU and Proba.

Fig. 14 reports confidence distributions for OBBs estimated using SPOT and Pleiades, as well as the precision-recall curves for the two data collections. Both confidence and mAP values are larger for predictions inferred on Pleiades imagery, which is to be expected given its higher spatial resolution and capability to discern smaller buildings. The largest difference between the two data collections can be noticed for larger values of recall, meaning that SPOT predictions generate larger FN, therefore missing to detect buildings, which are successfully detected in Pleiades imagery.

Let’s put all of this together and try to quantify the benefits of using a hierarchical building detection scheme. Table 1 shows the comparison between the savings and the accuracy of HIECTOR compared to using Pleiades imagery alone for the entire Azerbaijan (second column). Areas with validated ground truth data in Azerbaijan were used as reference. The simple addition of Sentinel-2 as screening for built-up area leads to x2.6 cost savings with negligible loss of accuracy (remember the 0.6% of additional FNs). Replacing Pleiades imagery with SPOT leads to large cost savings, but at the expense of detection accuracy. The full HIECTOR approach combining the three data collections provides the best trade-off between detection accuracy and cost saving. The behaviour of HIECTOR can be customised by changing the DDI threshold, operating between the fifth column (Sentinel-2 + SPOT, IDD = 2, mAP = 0.333) and the third column (Sentinel-2 + Pleiades, IDD = 0, mAP = 0.449). This behaviour can be fine-tuned depending on the use-case and on the user's requirements. Cost estimates include the subscription to Sentinel Hub, which is peanuts compared to the price of VHR imagery.

Table 1. Cost saving and accuracy analysis of using HIECTOR for building detection in Azerbaijan.

The estimates provided above are of course dependent on the AOI, on the percentage of built-up areas present and type of buildings. In general, given that in the majority of countries built-up areas cover the minority of land, these cost saving figures should only increase. As for the detection accuracy, areas of improvement that we are pursuing are the following:

improve labelling consistency of traininng data, as the same type of building, e.g. horse-shoe shape, can be represented differently, e.g. as a single OBB enclosing the entire building, or as 3 OBBs enclosing each building wing. This should increase the model confidence and reduce the number of generated predicitons;
increase the size of the training data. Recall that for training the SPOT/Pleiades model we used data from only 0.01% of the AOI. This should drastically reduce false positive detections;
optimise Proba and IoU thresholds based on the predicted building size, as the parameters are likely to be dependent;
reduce acquisition differences between VHR data collections. Building detections are sensitive to differences in viewing angle and sun-azimuth angle, as the building roof appears shifted wrt its footprint, and elongated shadows cover the areas surrounding the buildings.

HIECTOR: open-source

Finally!

Actually, you know what, we will provide more details about the material that we are open-sourcing in a separate blog post.

BUT, here is a sneak preview:

HIECTORcodebase, based on eo-learn, which allows you to execute both training and inference, is available on GitHub at sentinel-hub/hiector;
pre-trained weights on a different AOI than Azerbaijan (more details in next blog post) are available on the Query Planet open bucket. While you are there, check-out the other amazing datasets/models available;
example Jupyter notebook showing how to run HIECTOR inference in an end-to-end fashion is available on GitHub at sentinel-hub/eo-learn-examples/hiector;
Introduction and hands-on webinars describing the amazing work done in Query Planet are also available, featuring HIECTOR, a super-resolution algorithm for Sentinel-2 imagery, and forest mapping at the European level using Sentinel-2 time-series. Don’t miss them!

In an upcoming blog post we will describe how we used transfer learning to generalise the SSRDD models developed for Azerbaijan to Dakar in Senegal, and will present how results compare to existing machine-generated buildings, such as the Open Buildings dataset. Stay tuned!