Swimming pool detection and classification using deep learning
— By Divyansh Jha and Rohit Singh
Object detection is one of the most important tasks in the field of Computer Vision. Locating a specific object in an image is a trivial task for humans, but can be quite challenging for machines. The field has recently witnessed groundbreaking research with state of the art results, but taking this research to the field and solving real-world problems is still a challenge. Integration of the latest research in AI with ArcGIS, the industry leading GIS, opens up a world of opportunities ranging from feature identification and land cover classification to creating maps straight out of imagery.
At the plenary session in the Esri User Conference this year, we showcased one such integration to demonstrate detection of swimming pools using aerial imagery. We went a step further and were even able to identify which pools are in a state of neglect and might need inspection by health inspectors to prevent the spread of vector borne diseases.
The Problem
Tax assessors at local government agencies often have to rely on planimetric mapping services to create tax assessment rolls. Such surveys are expensive and infrequent, leading to inaccuracies in assessment of taxes. In the case of assessing property taxes, pools are typically added to assessment records because they impact the value of the property. Home improvements such as addition of swimming pools leads to increase in value of the property and thus increased property taxes. Finding pools that are not on the assessment roll will be valuable to the assessor. It will add incremental value to the property and ultimately mean additional revenue to the community.
Doing this through GIS and AI would certainly reduce the heavy amount of expensive human labor involved in updating the records through field visits of each property.
Additionally, the downturn and slow recovery of residential real estate market has left many homes across the country with neglected pools, that might be breeding grounds for mosquitoes. The sheer volume of properties affected in warmer climates has made the detection of these risky properties challenging for many organizations. Public Health and Mosquito Control agencies are responsible for providing the highest level of protection from vectors and vector-borne diseases. The problem with mosquitoes isn’t just their annoying nature and itchy bites. The spread of viruses carried by mosquitoes like west nile and chikungunya, are of grave concern to many agencies. To help with remediation efforts, these agencies need a simple solution that helps them locate neglected pool (green pools) from imagery and then use this intelligence to drive field activity and mitigation efforts. This solution also ties in to existing mosquito control solutions nicely.
This blog is a story of successive failures and triumphs during this project. This blog reveals our approach, what worked and what didn’t, and how integration with ArcGIS made the whole process much easier. Our notebooks and script files are published here on GitHub.
Creating training data
Deep Learning models require a large number of training examples to produce good results. There is a golden rule, ‘the more the data, the better the results’. Keeping this in mind we started searching for data on the Internet, but after only a few hours we learnt that there was no openly available labeled dataset for swimming pool detection using satellite imagery. We then labeled around 2,000 swimming pools in cities in Southern California. Surprisingly, it didn’t take that long.
We chose ArcGIS Pro to label some sample swimming pool locations. It provides access to a host of aerial, satellite and drone imagery from Esri and its partners — we used the Esri World Imagery basemap for labeling. This software includes an easy to use interface to label data as well as advanced GIS functionality, including tools for reviewing data to manage its quality. Additionally, it includes geoprocessing tools to create buffers and bounding boxes around labeled pool locations and includes the ‘Export Training Data for Deep Learning’ tool that can be used to create labelled image chips that are needed to train a deep learning model.
Another option to create image chips is using ArcGIS API for Python, which has methods for exporting images from Imagery (eg NAIP imagery layers) as well as Tile layers (such as the Esri World Imagery layer). We first created a shapefile containing the labeled pool locations using ArcGIS Pro and chipped out 224x224 images from aerial/satellite imagery using the locations in the shapefile. The GitHub repository includes a notebook demonstrating this approach.
Training deep learning models
For training the object detector, we used a Single Shot MultiBox Detector (SSD) inspired architecture using Focal Loss, as explained in the fast.ai course (lesson 9). The fast.ai library (by Jeremy Howard) is built on top of PyTorch and helps create state of the art models quickly and train them efficiently using some slick optimizations. It provides a high-level API for creating and training deep learning models, while also allowing fine grained control to customize everything.
We used Resnet-34 as the base model and added a Single Shot MultiBox Detector on top of it using PyTorch. Resnet-34 is an image classification model, which was trained on over 1 million images of the ImageNet visual recognition challenge. For visual intuition, the SSD head architecture is shown below.
For training, we used the Adam optimizer using one-cycle learning rate schedule. We also employed discriminative learning rates while fine-tuning the model. All of the stated techniques are provided by the fast.ai library.
Imagery
An important consideration for training deep learning models is to pick the imagery to be used. Using the most current and spatially accurate satellite imagery is important. The resolution at which to perform the training and inference, as well as which bands to be used can be critical.
The ArcGIS platform provides access to a large collection of aerial, satellite and drone imagery through the Living Atlas. This includes imagery from Landsat, Sentinel and NAIP (National Agriculture Imagery Program) programs. High resolution imagery is also available through partners and includes 7cm imagery from Nearmap and Vexcel.
NAIP imagery is acquired during the agricultural growing seasons throughout the continental US. This imagery is of 1 meter resolution and is collected every two years for a given area. Counties can use this imagery for detecting pools for free. Additionally, several counties in the US also often collect their own orthoimagery (aerial imagery that has been geometrically corrected) every one or two years. These could be used as well. We chose to use NAIP imagery for detecting pools as it is free, and is available throughout the US.
Discriminating between clean and green pools on the other hand, requires more recent imagery at a higher resolution to derive actionable insight. Nearmap and Vexcel imagery is collected much more often and provides a much higher resolution. We chose to use Nearmap imagery for classifying pools as clean or green, as it is included in the Esri World Imagery basemap for the area of Redlands.
Which Bands to Use?
Satellite imagery often includes bands other than just the visible spectrum. It might seem obvious to use all available bands for training the model. However, there are certain advantages to using just 3 bands and that worked quite well for us in practice.
First, the RGB bands are always available no matter which satellite / sensor is used. In theory, we could train a model on imagery from one satellite / sensor and deploy it on another. This strategy could also be used for data and test time augmentation and further improve model performance.
Second, and perhaps more importantly, we could use transfer learning. Even though satellite images are quite different from photographs of everyday objects, they do tend to have similar features such as edges, shadows, curves, textures and so on. These are the lower level features that convolutional neural networks (CNNs) first learn to recognize. A pre-trained neural network that has been trained on over 1 million images from the ImageNet corpus already knows how to extract such features and fine-tuning it is superior to training a new network from scratch using just a small number of satellite images.
We initially fine-tuned the model using RGB bands from NAIP imagery. However, the results turned out not to be so good. Our object detector was predicting an object at the center of the image every time and it was also missing many pools. After inspection of the training data, we recognized a problem in the way we were chipping out the images using Python. We didn’t realize that when selecting the location from the shapefile the coordinates always lie at the center of the image. This was causing the model to overfit to the center position of the swimming pool in the test results. We changed this strategy and introduced asymmetry while chipping out the images. Also, to avoid missing pools that were on an edge, we also included truncated bounding boxes that we were ignoring before. After doing these additional data processing steps, we rechecked our results but weren’t much impressed with them again. The validation loss was around 40.88 which was very high as compared to what we achieved in the future models.
Next, we used NDVI (colorized) NAIP imagery from the Living Atlas. Normalized difference vegetation index (NDVI) is computed from the red and near-infrared bands and is often used to gauge the health of vegetation. Upon visual inspection of this layer, the swimming pools stand out in a bright red color, and we assumed that the network will be able to take advantage of this fact. However, the validation loss was around 30 and the results weren’t weren’t as good as expected. In retrospect, it makes sense as when using NDVI, we are losing information from one band altogether, and the neural network has much less information to work with.
The next logical step was to use three bands but try a band combination other than RGB. The USA NAIP Imagery: Color Infrared uses the Near-Infrared, Red and Green bands and allows the pools to stand out due to their being cooler than the surroundings. An example of the NAIP Color Infrared imagery (False Color composite) is below.
We can easily locate blue patches where the swimming pools are. We then chipped out these images from the NAIP infrared imagery and trained our model to finally see improved results . We were able to get decent results with around 2,000 NAIP infrared images but the model still made mistakes in detecting all pools. At this point, we recalled the golden rule. We did heavy data augmentation, by taking 50 random jitters around each pool location. Using this technique, we were able to convert those 2,000 images to 100k images. Upon training the complete model again, the validation loss went down to around 18. We tried more training but the model started to overfit after that. Let’s see some of the results after training completely on NAIP Infrared Imagery.
Inferencing
Once the model was fully trained and giving good results, we wanted to test it out on a larger area than just the small image chips used for training and validation. We created a script to export a larger area of the NAIP imagery and find all pools within it. This was done by further splitting the larger image into smaller sized chips which the model requires. All these chips were simultaneously passed as a batch to the model and the predictions were gathered, combined and visualized. Below is the result of that visualization.
The whole point of this project was to do it at scale, so we decided to run our model on an entire city using the capabilities of ArcGIS API for Python. We took the extent of city of Redlands and exported NAIP images from that area. We then used the simple pipeline described above to collect predictions on all chips within each exported image. The predictions were then converted into a feature layer by transforming from image to geographic coordinates. A feature layer is a grouping of similar geographic features like pools which can later be visualized on a base map using the ArcGIS platform.
Test Time Augmentation
On observing the visualizations carefully, we found that there were still a number of missing pools. We also noticed a strange trend that the missing pools used to lie in a line either horizontally or vertically. After some more analysis, we found that the pools which were missing were at the edges of the chips. In order to overcome this issue, we performed test time augmentation. The idea of test time augmentation is that if we show our model a couple of slightly altered versions of the same image, we hope that overall it will do better than if it saw just a single image. Our main strategy was to somehow move the pools on the edges into the center and detect those as well. Firstly, we reduced the stride when chipping out the imagery so that no pool is left at the edge of the chip.
Secondly, we did predictions twice. The first on the actual chip and the second one on a center crop from the original chip. We selected the center crop in such a way that the pools which were positioned at the edges of the smaller chips now started to appear at the center. This simple strategy allowed the missing pools to be detected. Extending this approach to five different center crops enabled us to increase the recall (fraction of correctly detected pools over the total amount of actual pools) without negatively affecting the precision (fraction of correctly detected pools among all detected pools).
Non-Max Suppression on Maps
The above augmentations at test time created multiple predictions for the same swimming pool when we visualized on the map. We wrote a non-maximum suppression function which would select the pool with the highest score, when there are a bunch of pools within a specified distance. We would suppress all pools within a range of k meters from the pool with the highest score. We tuned this hyperparameter k and got best results when k was 15m. Now, we could do test time augmentation as many times and this algorithm would retain just the best detection. The results of this step are visualized below.
Using GIS to suppress false positives
Deep learning is great at what it does, but can still make silly mistakes at times — for instance, we occasionally got false positives for pools on freeways and as well as on the hills! Many of the false positives were of low confidence and got filtered out, but some were high confidence false positives, perhaps as a result of overfitting. There were false positives specifically over large water bodies which actually contained water and appeared blue in the NAIP imagery. We considered several options to remove these false positives, like training the network to detect a second class of water bodies, but that seemed to be an overkill to solve the problem.
Since we are looking for pools in residential parcels, an easy way to discard the false positives is to simply overlay the detected pools with the layer of residential parcels and throw away the pools that don’t intersect with the parcels layer. ArcGIS Online provides analysis tools to do just that. This strategy gave even better results, shown below.
Identifying parcels with unassessed pools
Now that we had a good pool detector, we wanted to find those parcels containing swimming pools which are not being assessed correctly. The Join Features tool in ArcGIS Online came in handy and we were able to create information products like feature layers of the unassessed pools as well as web maps for visualizing the results. Surprisingly, we were able to identify approximately 600 new pools that were not marked correctly in the database.
A web map containing the results of this analysis is at http://arcg.is/0r0HKP and a couple results are shown below:
Clean or Green?
Once we got good results for detecting the pools, we took a step further to classify them as clean or green (i.e. neglected pools, sometimes also referred to as ‘zombie pools’). Green pools often contain algae and can be breeding grounds for mosquitoes and other insects. Mosquito Control agencies need a simple solution that helps them locate such pools and drive field activity and mitigation efforts.
Below are the example of some clean and green pools.
We exported about half of the detected pools using recent high resolution Nearmap imagery (7 cm resolution) and manually labeled them as clean or green pools. The ratio of clean to green pools was very high. For every green pool we had a 100 clean pools. It was difficult to train on this highly unbalanced dataset. We augmented the data by jittering around the area of the detected green pools and created 100 images in different positions around each pool, which were like this.
The above technique made our data balanced. Then we fine-tuned a Resnet-34 classifier on it and were able to get an excellent f1-score of 97.6. Including this in our inference pipeline allowed us to get good detections of these so-called zombie pools, as seen in this web map.
Distributed inferencing
One of the things that bothered us was the relatively long time it used to take to do the inference. It took us approximately 10 minutes on Google Cloud Platform to perform pool detections on the complete City of Redlands — this could be a problem for a live demo. We then got our hands dirty with distributed computing on GPUs. We wrote an inference script which used python’s subprocess module to call different GPU processes and inference on pre-downloaded chips. On a single p2.16xlarge AWS instance, we were able to inference within 50 seconds on the entire city covering an area over 100,000 sq km. With this speed we can detect pools in San Bernardino, which is the largest county in the US, in under an hour.
Deployment
A primary goal of this project was to apply the latest research in deep learning and use it to solve real-world problems, be it for updating outdated county records or to galvanize mosquito abatement drives.
The ArcGIS platform includes a host of capabilities ranging from online mapping, analysis, collaboration and field mobility to help achieve these goals.
Once we had obtained the locations of the detected swimming pools, it was relatively easy to use the analysis tools in ArcGIS Online to identify parcels that were not being accessed correctly. The Spatial DataFrame in ArcGIS API for Python provided an easy to use, pandorable way for generating reports of such properties, as well as create information products such as GIS layers that could be visualized in web maps or used for further analysis. The map widget for Jupyter notebooks not only enabled visualization of the detected pools and residential parcels, but also provides renderers and symbology to make it easy to understand the generated maps. The maps could then be saved as web maps and shared with collaborators.
Esri has also recently introduced (as beta) an Image Visit configurable app template that lets image analysts visually inspect the results of an object detection workflow and categorize them as correct detections or errors. A live demo of the configured web app is here. This information could then also be fed into better training or to filter the results and prioritize field activities.
That’s where the field mobility capabilities of the ArcGIS platform can be put to use. Workforce for ArcGIS allows for creation of assignments for mobile workers, such as inspectors in mosquito control agencies, and drive field activity. We used the recently introduced apps module in ArcGIS API for Python to automate creation of Workforce assignments for field workers. These assignments make it easy for field workers to stay organized, report progress, and remain productive while conducting mosquito abatement drives based on the results of the neglected pool detection analysis.
We have only scratched the surface of what the integration of deep learning and GIS can do. This is just one application and there are countless others waiting to be powered by these amazing technologies. Stay tuned for more exciting stuff coming out soon!
If you liked the article, please hit that clap button. For any queries/doubts/errors contact us on twitter at @divyanshjha and @geonumist.
This project is among some of the innovative deep learning work that is being carried out at the Deep Learning Dev Center in New Delhi, India. Contact rsingh@esri.com if you’re interested in an internship opportunity and would like to bring deep learning to the Science of Where.