How we approached a foundational model and ended up with a detector — a real use case
Many problems in computer vision have been solved through a target-specific approach to sorting the wheat from the chaff—that is, the target class pixels from all the rest. With the rise of foundational models, each pixel is classifiable and can be assigned to the target class just by prompting the model.
While doing our bit of research and solving the analytical tasks of image classification, we’d like to start sharing some use cases.
Business task definition
Get the addresses of the households with open swimming pools for the company that provides construction and equipment services.
Preliminary study
In the preliminary research, we studied the area using aerial imagery. Based on the visual interpretation we defined a few options to develop the model:
- Develop a model that is capable of detecting and mapping out the precise contours of the swimming pools — a segmentator.
- Develop a model that identifies two classes: swimming pools + land plots.
This is the most complex option, but if implemented, it would allow us to visually confirm and link the target (swimming pool) and household ownership by spatial intersection. - Don’t pay heed to both approaches; just develop a detector that will outproduce the target object coordinates, which can be geocoded into street addresses.
Guess what we intended to choose? The simple detector option would be the fastest and easiest way to implement a solution. But…
We were intrigued by the foundational approach and wanted to give it a try.
Experimenting with the text prompting + segmentation led to poor results
We tried Segment-anything (SAM), which was already hosted in Mapflow, after we got hooked by its shining star above the image recognition models. Since then it’s been collecting the lowest user feedback rates compared to the single-purpose models developed by ourselves. But maybe something was wrong with our application of SAM, maybe there are the best practices we should try?
The moment we started this research, the SAMgeo project demonstrated amazing results by combining Grounding Dino (the model with the text encoder for object detection) and Segment-anything, which is to be called the foundational segmentation model.
Despite the impressive sample results obtained on the small image (600x600px), the zero-shot prompting concept completely failed in our study area. As we changed the size of the input image — the recall was too low as you can see it in the image below (based on SAMgeo implementation).
text_prompt = "swimming pool"
sam.predict(image, text_prompt, box_threshold=0.24, text_threshold=0.24)pyth
Of course we started implementing the text prompting into Mapflow API –
we extended the inference params with the text prompt and thresholds:
"params": {
"url": "s3://users-data/2fce72a6-d5e5-469b-81c7-0fd2a0974297.tif",
"inference.text_prompt": "palm tree",
"inference.box_threshold": "0.15",
}jsonj
But the output wasn’t either stable or accurate enough:
Finally, we paused (not sure for how long) the experiments with the foundation model in the context of its application to the swimming pool identification and decided to go for the training of a lightweight detector.
Using Mapflow for pseudo-labeling
As we usually do at Geoalert, we set up a new instance in our Mapping database and leveraged QGIS to manually label a couple of hundreds of swimming pools with bounding boxes. As soon as we trained the base model, we used Mapflow to generate results for a larger area. It drastically sped up the labeling process to improve the model accuracy and its further transfer to the validation phase. It would take ~3 hours for the cartographer to interpret and make these annotations in a study area. Assisted by pseudo labeling using Mapflow — QGIS plugin, it took ~30 minutes. Even more important is that we got proof that we will complete the whole commercial project in time.
To facilitate the validation, we applied a regular grid to the extended area and assigned it to cartographers to verify, delete the false positives, and add the missing objects.
Modifying the Neural Network
A decision was made to use a model that was as lightweight as possible. Also, since we don’t need the exact bounding boxes or segmentation masks, only coordinates, we could avoid using complex object/instance decoders in favor of a simple, fully-CNN network that will detect swimming pool instances as a low-resolution mask.
Swimming pools are easily visually separable at ~0.3 m/px scale (zoom 19 using Web mapping terminology), at which they have a distinctive size of about 15–30 pixels. Hence, the model should downsample the image by the factor of 16 in order to make a single swimming pool instance about one pixel large and maintain distinguishability between neighbor instances.
Efficientnet-b1 backbone was a good choice for markers extraction because of its high performance versus a number of parameters ratio. In its deepest layers, it extracts featuremasks downsampled by the factor of 32, while we need it to be downsampled by 16. The most straightforward way to avoid excess downsampling while maintaining a neuron’s receptive field in deep layers is to replace convolution with a stride=2 in the sixth layer with dilated convolution:
Next, each feature mask is projected into embedding dim (256) by a depthwise convolution, which allows us to sum them and, finally, convolve them into a binary mask, where the value of each pixel corresponds to the probability of a swimming pool in an area approximately 5x5 meters.
The geocoding and data delivery
We chose Google Maps API to get the addresses with the swimming pool coordinates. We also tried Nominatim as a free alternative to Google Maps, but it appeared to be worse when searching for the nearest address. As you can guess the open swimming pools have a close mapping location to the house address and need to be linked to it by some algorithm that is a bit more tricky than the minimum distance. The open cadastral data that is available for the U.S. (is it?) might play an important role in such kind of search. Actually, we don’t feel confident enough about the proper algorithm, but the fact that Google Maps made less than 1% error in our testing selection of addresses convinced us. Fortunately, we didn’t have to develop the parcel segmentation to minimize the geocoding errors — let’s postpone it for another challenge. What about other countries with swimming pools but limited open cadastral data? How would Google API be accurate? We haven’t checked yet. 🤔
Further improvements
- Even though we got almost 100% precision on the testing images, we realized that the model won’t keep the same metrics while the project is scaling up. The model can misinterpret typical objects like blue roofs, blue vehicles, and even shadows cast by buildings. That’s why we paid attention to the manual validation of the results and iterated through the six cycles of the model training before we got confident about the stability.
- We applied a post-processing technique called non-maximum suppression to eliminate duplicate detections and select the most confident.
- The object-specific cases require a significant computational resource and fine-tuning techniques based on a foundational model approach, which is why we chose a straightforward and lightweight detector. However, even if it wasn’t required in the particular project, we look forward to implementing more cases related to the classification of households and land plots, which can be powered by the foundational model.
Useful links:
- Segment Anything Demo by Meta AI
- Open repo with examples and cases of SAM in application to geospatial imagery (sam-geospatial)
- Mapflow AI