Geoalert platform

About Geoalert platform for new applications of the Earth Observation data powered by Deep learning analysis

How we approached a foundational model and ended up with a detector — a real use case

GeoAlert
Geoalert platform
Published in
6 min readDec 12, 2024

--

Many problems in computer vision have been solved through a target-specific approach to sorting the wheat from the chaff—that is, the target class pixels from all the rest. With the rise of foundational models, each pixel is classifiable and can be assigned to the target class just by prompting the model.
While doing our bit of research and solving the analytical tasks of image classification, we’d like to start sharing some use cases.

Business task definition

Get the addresses of the households with open swimming pools for the company that provides construction and equipment services.

Preliminary study
In the preliminary research, we studied the area using aerial imagery. Based on the visual interpretation we defined a few options to develop the model:

  1. Develop a model that is capable of detecting and mapping out the precise contours of the swimming pools — a segmentator.
  2. Develop a model that identifies two classes: swimming pools + land plots.
    This is the most complex option, but if implemented, it would allow us to visually confirm and link the target (swimming pool) and household ownership by spatial intersection.
  3. Don’t pay heed to both approaches; just develop a detector that will outproduce the target object coordinates, which can be geocoded into street addresses.
Study area labeling — segmentator vs detector

Guess what we intended to choose? The simple detector option would be the fastest and easiest way to implement a solution. But…
We were intrigued by the foundational approach and wanted to give it a try.

Experimenting with the text prompting + segmentation led to poor results

We tried Segment-anything (SAM), which was already hosted in Mapflow, after we got hooked by its shining star above the image recognition models. Since then it’s been collecting the lowest user feedback rates compared to the single-purpose models developed by ourselves. But maybe something was wrong with our application of SAM, maybe there are the best practices we should try?
The moment we started this research, the SAMgeo project demonstrated amazing results by combining Grounding Dino (the model with the text encoder for object detection) and Segment-anything, which is to be called the foundational segmentation model.

Despite the impressive sample results obtained on the small image (600x600px), the zero-shot prompting concept completely failed in our study area. As we changed the size of the input image — the recall was too low as you can see it in the image below (based on SAMgeo implementation).

text_prompt = "swimming pool"
sam.predict(image, text_prompt, box_threshold=0.24, text_threshold=0.24)pyth
Not so good — a lot of missing objects, while trying the zero-shot prompting with the different image size and thresholds 🤷‍♂️

Of course we started implementing the text prompting into Mapflow API –
we extended the inference params with the text prompt and thresholds:

    "params": {
"url": "s3://users-data/2fce72a6-d5e5-469b-81c7-0fd2a0974297.tif",
"inference.text_prompt": "palm tree",
"inference.box_threshold": "0.15",
}jsonj

But the output wasn’t either stable or accurate enough:

Model’s output on aerial imagery (10cm/px) 👎🥲

Finally, we paused (not sure for how long) the experiments with the foundation model in the context of its application to the swimming pool identification and decided to go for the training of a lightweight detector.

Using Mapflow for pseudo-labeling

As we usually do at Geoalert, we set up a new instance in our Mapping database and leveraged QGIS to manually label a couple of hundreds of swimming pools with bounding boxes. As soon as we trained the base model, we used Mapflow to generate results for a larger area. It drastically sped up the labeling process to improve the model accuracy and its further transfer to the validation phase. It would take ~3 hours for the cartographer to interpret and make these annotations in a study area. Assisted by pseudo labeling using Mapflow — QGIS plugin, it took ~30 minutes. Even more important is that we got proof that we will complete the whole commercial project in time.
To facilitate the validation, we applied a regular grid to the extended area and assigned it to cartographers to verify, delete the false positives, and add the missing objects.

Project study areas and the validation GRID (based on the Kontur’s population dataset)

Modifying the Neural Network

A decision was made to use a model that was as lightweight as possible. Also, since we don’t need the exact bounding boxes or segmentation masks, only coordinates, we could avoid using complex object/instance decoders in favor of a simple, fully-CNN network that will detect swimming pool instances as a low-resolution mask.

Swimming pools are easily visually separable at ~0.3 m/px scale (zoom 19 using Web mapping terminology), at which they have a distinctive size of about 15–30 pixels. Hence, the model should downsample the image by the factor of 16 in order to make a single swimming pool instance about one pixel large and maintain distinguishability between neighbor instances.

Efficientnet-b1 backbone was a good choice for markers extraction because of its high performance versus a number of parameters ratio. In its deepest layers, it extracts featuremasks downsampled by the factor of 32, while we need it to be downsampled by 16. The most straightforward way to avoid excess downsampling while maintaining a neuron’s receptive field in deep layers is to replace convolution with a stride=2 in the sixth layer with dilated convolution:

Modifiying the CNN architecture for our task

Next, each feature mask is projected into embedding dim (256) by a depthwise convolution, which allows us to sum them and, finally, convolve them into a binary mask, where the value of each pixel corresponds to the probability of a swimming pool in an area approximately 5x5 meters.

Detection of swimming pools. Modified model’s output — markers. Note that the output spatial size is 16 times smaller than the input size

The geocoding and data delivery

We chose Google Maps API to get the addresses with the swimming pool coordinates. We also tried Nominatim as a free alternative to Google Maps, but it appeared to be worse when searching for the nearest address. As you can guess the open swimming pools have a close mapping location to the house address and need to be linked to it by some algorithm that is a bit more tricky than the minimum distance. The open cadastral data that is available for the U.S. (is it?) might play an important role in such kind of search. Actually, we don’t feel confident enough about the proper algorithm, but the fact that Google Maps made less than 1% error in our testing selection of addresses convinced us. Fortunately, we didn’t have to develop the parcel segmentation to minimize the geocoding errors — let’s postpone it for another challenge. What about other countries with swimming pools but limited open cadastral data? How would Google API be accurate? We haven’t checked yet. 🤔

Further improvements

  • Even though we got almost 100% precision on the testing images, we realized that the model won’t keep the same metrics while the project is scaling up. The model can misinterpret typical objects like blue roofs, blue vehicles, and even shadows cast by buildings. That’s why we paid attention to the manual validation of the results and iterated through the six cycles of the model training before we got confident about the stability.
Example of false positive objects detected by the early version of the swimming pool detector
  • We applied a post-processing technique called non-maximum suppression to eliminate duplicate detections and select the most confident.
  • The object-specific cases require a significant computational resource and fine-tuning techniques based on a foundational model approach, which is why we chose a straightforward and lightweight detector. However, even if it wasn’t required in the particular project, we look forward to implementing more cases related to the classification of households and land plots, which can be powered by the foundational model.

Useful links:

--

--

Geoalert platform
Geoalert platform

Published in Geoalert platform

About Geoalert platform for new applications of the Earth Observation data powered by Deep learning analysis

GeoAlert
GeoAlert

Written by GeoAlert

We apply Machine learning to automated analysis over Earth observation data

No responses yet