Using Deep Learning to classify facility rooftops and determine solar installation potential ☀️

Margaux Masson-Forsythe
Published in
10 min readSep 1, 2021
Rooftops detection — Source: Omdena

Solar energy hasn’t reached its full potential as a clean energy source for the United States yet. According to National Renewable Energy Laboratory (NREL) analysis in 2016, there are over 8 billion square meters of rooftops on which solar panels could be installed in the United States, representing over 1 terawatt of potential solar capacity.

Therefore, significant work remains to be done to accelerate the deployment of solar energy in the US.

Problem Statement

The goal of this 2 months challenge was to detect and classify Commercial and Industrial rooftops in North America in order to identify the potential of facilities’ solar installation in this region. Solar rooftop potential for the entire country is the number of rooftops that would be suitable for solar power, depending on the uncluttered surface area, shading, direction, and location.

The challenge partner was a Techstars Energy tech startup called “EnergyTech Startup” that provides a digital map of these facilities' rooftops and energy profiles.

The final goal of this challenge is to accelerate the growth of solar installations in the United States.

This article was originally published on Omdena’s blog. Learn more about Remote Sensing and Satellite imagery here.

Our Approach

In this article, we will focus on the detection and classification of commercial rooftops using Deep Learning methods. The first step was to gather, preprocess and label the data.

Data preparation 🎨

For this project, we used satellite images of the rooftops at specific GPS coordinates. The GPS coordinates were provided by the partner. All the images were downloaded from Mapbox. Then, we manually labeled a large amount of the images using the labeling tool CVAT (Computer Vision Annotation Tool). CVAT is a free, open-source, web-based image annotation tool. It is a very powerful and efficient tool that allows multiple collaborators to label images, and then review the labeling done.

Demonstration of labeling done on a rooftop in CVAT — source: Omdena

One of the best features of CVAT, which was essential for this project, was that CVAT supports labeling for object detection, image classification, and image segmentation. Thus, the annotation done on one image can be exported in any format (e.g bounding boxes, coco format, segmentation masks, etc).

GIF showing CVAT export formats — source: Omdena

Half of the project was actually spent on labeling the images. We chose to label 5 different classes of rooftops:

  • Flat: Rooftops with a single flat surface with/without clutters
  • Complex Rooftop: Rooftops with multiple surfaces at different heights
  • Existing Solar: Rooftops with solar panels
  • Heavy Industrial: Rooftops with pipes, and cluttered with machinery
  • Slope: Rooftops with an inclined surface
Examples of images and their label — Source: Omdena

Once our team was done with the labeling, we had 3895 Flat / 2278 Slope / 1050 Complex / 450 existing solar / 450 heavy industrial instances labeled on 600x580 images.

Rooftop Segmentation

We started by segmenting rooftops using semantic segmentation models. Semantic segmentation is the process of attributing a class to each pixel of an image. So in our case, we want to classify all pixels as rooftop or non-rooftop. We are implementing a binary semantic segmentation of rooftops.

Example of mask used for labeling — source: Omdena

We tried to use the AIRS (Aerial Imagery for Roof Segmentation) dataset for initial training and then fine-tune the model, but the results were not convincing — the main issue was that the zoom level was pretty different between the AIRS dataset and our data and even by cropping the AIRS images, we did not end up with good enough results. However, once we had enough data labeled, U-Net was able to segment the rooftops more accurately:

U-Net binary segmentation of rooftops — source: Omdena

This training could be improved by running for a longer period of time and with more images, but we decided to switch to other models as we wanted to have a model that could accurately and efficiently detect and classify multiple classes. We tried to use U-Net for multi-classes semantic segmentation but the results were not good.

Multi-classes semantic segmentation results with U-Net — source: Omdena

Indeed, as we can see in the results above, the net was not able to learn the fact that one rooftop is a whole entity by itself, and is predicting pixels of the same roof as different classes which is not what we wanted for this project. For example, on the first two results, we see that the net wrongly classified random pixels of the rooftop. The predictions ended up being a mix of several classes for the same roof.

In the same scope, we used the DeepLab model. We wanted to classify all pixels as a specific type of rooftops. So this is not binary semantic segmentation in this case but multi-classes semantic segmentation — like we tried with U-Net — but we wanted to test this with another model.

The goal here is to classify pixels as belonging to one of the 5 classes presented earlier (flat, slope, heavy industrial, existing solar, complex).

With this model, we were able to use the pre-trained version, trained on the AIRS dataset, and fine-tune it on our data. The overall IOU score we obtained was 0.48 which is not good enough.

Deep-Lab predictions — source: Omdena

Here again, we see the same behavior we had with the multi-classes U-Net predictions.

At this point, we understood that semantic segmentation itself was too pixel-specific and that we needed a model that would converge faster and learn about the different types of rooftops as entities instead of only looking at the pixel level.

We, therefore, decided to focus on Instance Segmentation models instead.

Instance Segmentation

Instance segmentation differs from semantic segmentation in the way that with semantic segmentation the model predicts:

“This pixel is a pixel belonging to a flat rooftop”

when with instance segmentation, the model predicts:

“This is a flat rooftop and here are all the pixels of the roof”.

So the instance segmentation model understands that the roof is an entity by itself.

Here are some examples of the labeled images COCO-style that shows that each rooftop is its own object:

Labeled rooftops images exported into COCO format — source: Omdena

We tried two instance segmentation models: Mask R-CNN and Yolact.

1) Mask R-CNN with Detectron2

We used the Facebook AI Research library called Detectron2. This library provides state-of-the-art detection and segmentation algorithms such as Mask R-CNN. The documentation for this library can be found here.

We modified the original Detectron2 tutorial Google Colab notebook for our project with our custom rooftop dataset. Our dataset was created from CVAT where we exported the annotations as COCO format. We first had one single json file with all the annotations that we split into a train and a validation json file using the cocosplit tool.

Once everything was well set up and configured, we fine-tuned a COCO-pre-trained R50-FPN Mask R-CNN model on our rooftops dataset.

Our final metrics were (AP = Average Precision in %):

‘AP-flat’: 34.46745607306496
‘AP-slope’: 9.342863008612126
‘AP-existing_solar’: 1.7519702923446772
‘AP-complex_rooftop’: 14.343733930311137
‘AP-heavy_industrial’: 5.055339100777756

So the model is learning about Flat and Complex and roofs but does not seem to understand Existing Solar, Slope, and Heavy industrial very well. These classes are under-represented in our dataset which might be the reason for this low performance.

Mask R-CNN results — source: Omdena

From the results, we see that the net is doing a fairly good job at detecting and classifying flat rooftops.

Here are some examples of misclassified rooftops (mostly slope and existing solar all classified as flat rooftops):

Examples of the wrong classification with Mask R-CNN — source: Omdena

We also saw some wrong predictions of the class “Existing solar” (second image below) but the model missed the real solar panels on the first image. However, we can notice that the net is learning about the Slope rooftops (first and third images below):

Predictions Mask R-CNN — source: Omdena

This imbalance in the results could be improved by increasing the number of examples for the under-represented classes such as Existing solar. We could also improve the model by playing more with the hyperparameters.

But for this project, we did not have the time for this and chose the network that converged to satisfying results the fastest which we obtained with the Yolact model.

However, we want to specify that with further training and a higher learning rate, and more epochs, we were able to get better results with the Mask R-CNN at the very end of the project. So this shows that this network has some potential:

Mask R-CNN improved results with learning rate = 0.001 and 4000 epochs — source: Omdena

These results were much better than our first results (for example the results on the first image are now very good compared to the previous results on this image presented previously) — but still — the network had issues being generalized to most of the images from the validation set.

2) Yolact

Yolact is a fully convolutional model for real-time instance segmentation. It also uses COCO-style Object Detection JSON annotations. Our final model configuration is the yolact_im700_config with 300 prototype masks and a ResNet 101 backbone. It was trained for a total of 63 epochs with a batch size of 8 with other parameters left at default on a total of 3092 images.

We used the deepest Yolact ResNet 101 model, however, there are other variants that could be used for further improvements — along with more hyperparameters fine-tuning.

The Yolact git repository also provides a different version called Yolact++ that is supposed to yield better results. However, it required more installation and some additional configuration which we did not have enough time for.

Our final metrics on the full dataset obtained for the Yolact ResNet 101 model were:

Box Localization Loss — 0.70
Confidence Loss — 1.6
Mask Loss — 1.1
Semantic Segmentation Loss — 0.40

This model had the best results and is the model we ended up using for our final deliverable. Here are some visualizations of the predictions on unannotated images:

Yolact’s results — source: Omdena

We can see here that we have pretty good results: the rooftops with solar panels were classified as “Existing solar”. We also have some “Slope” and “Complex” rooftops being correctly classified. These results are satisfying and show that not only the predominant class “Flat” was correctly predicted which is what we were looking for.

Final Deliverable

Our final product is a Streamlit application that uses the Yolact pre-trained weights to detect and classify rooftops on images that the user can either upload directly on the app, or can access through GPS coordinates.

Visual of the Streamlit application in the class prediction section of the application — source: Omdena

You can find a demo of the final Streamlit dashboard on Youtube.


This 2 months project was a lot of fun and we learned so much. Using satellite images is always a challenge, but I think, the most challenging part of this all project was the labeling effort done by the team which is incredible!

Big thank you to all collaborators who made this project a success: Alisson Damasceno, Amal Mathew, Amardeep Singh, Ampatishan Sivalingam, Ansh Motiani, Ayushi, Chebrolu Harika, Dhruvan Choubisa, Hadi Babaei, Javier Smith, Kayalvizhi Selvaraj, Maitreyi Nair, Mihir Godbole, Nishrin Kachwala, Ozan Ahmet Çetin, Parth Dandavate, Praful Mohanan, Qasim Hassan, S.Koushik, Sanjay Parajuli, Sara Faten Diaz, Sarang Nikhare, Sudarshan Paul, Syeda Iffat Naz, Tawanda Mutasa.