Jhansi Anumula
Jun 19 · 5 min read

Aerial and satellite imagery gives us the unique opportunity to look down the earth in a bird’s eye view. It is being used to measure deforestation, map damaged areas after natural disasters, spot looted archaeological sites, and has many more current and untapped use cases. As these images are high in resolution, it is difficult for the human eye to detect relevant information from the data. This is where the computer vision adds excellent value to get insights.

Pixel-wise image segmentation is a challenging and demanding task in computer vision and image processing. This blog is about the Segmentation of Buildings from Aerial (satellite/drone) images. Availability of high-resolution remote sensing data has opened up the possibility for exciting applications, such as per-pixel classification of individual objects in greater detail. By the use of Convolution Neural Network (CNN), Segmentation and classification of images have become very efficient and smart.

What is Image Segmentation? What are the types of Image Segmentation available?

Segmentation partitions an image into distinct regions containing each pixel with similar attributes. The regions should strongly relate to depicted objects or features of interest to be useful for image analysis and interpretation. Meaningful Segmentation is the first step from low-level image processing, transforming a greyscale or colour image into one or more other images to high-level image description in terms of features, objects, and scenes. The success of image analysis depends on the reliability of Segmentation, but an accurate partitioning of an image is generally a very challenging problem.

Two types of Segmentation are available for images — 1. Semantic Segmentation 2. Instance Segmentation.

Semantic Segmentation is the process of assigning a label to every pixel in the image. This is in stark contrast to Image Classification, in which a single label is assigned to the entire picture. Semantic Segmentation treats multiple objects of the same class as a single entity. On the other hand, Instance Segmentation treats multiple objects of the same class as distinct individual objects (or instances). Typically, Instance Segmentation is harder than Semantic Segmentation.

Comparison between semantic and instance segmentation. (Source)

Dataset: This project uses the data from The Inria Aerial Image Labeling Dataset (link). This dataset consists of 180 aerial images of urban settlements in Europe and the United States, and is labelled as a building and not building classes. Every image in the data set is RGB and has 5000×5000 pixels resolution where each pixel corresponds to a 30cm×30cm of Earth surface. This project is completed as a part of fellowship.ai.

Taken from paper

Data pre-processing: As we can’t use high-resolution images directly into our code, we sliced the image into multiple small resolution (256×256) images with some pixels of overlap. These sliced images are used to train the model. So from one original image, we have got 400 sliced images. Also, the information about the location of the sliced image gets retained. For preprocessing, we used Pytorch library.

Train and Validation split: For each state, the first five images were taken as validation images, as mentioned in the paper (link).

Building the model: For building the model, we used the fastai library (which sits on top of PyTorch). We used U-Net architecture with pre-trained resnet18 (resnet34/50 may give better results. We stick with resnet18 due to computational issues) as the encoder. We also used flip_vertical, max_lighting, max_zoom, max_warp as transformations along with default transformations provided by fastai.

Pre-trained model is trained on a different task than the task at hand but provides a beneficial starting point because the features learned while training on the old task is useful for the new task.

If you want to understand the architecture of U-net, here is the very well explained blog. It also covers the basic concepts of Image Segmentation in detail.

We trained the model for 15 epochs (fit_one_cycle) and each epoch took around 45 minutes for all the images (Train: 155*400=62000 and Validation: 25*400 = 10000 images).

The model was trained using Google Colab GPU.


Loss function: A combination of binary cross entropy loss and dice loss with IOU as True.

Binary cross-entropy loss: Binary cross-entropy is a loss function used on problems involving yes/no (binary) decisions. For instance, in multi-label problems, where an example can belong to multiple classes at the same time, the model tries to decide for each class whether the example belongs to that class or not.

Where ŷ is the predicted value.

Binary cross-entropy measures how far away from the true value (which is either 0 or 1) the prediction is for each of the classes and then averages these class-wise errors to obtain the final loss. This metric used as performance measurement in image classification models.

Dice loss: This is overlap loss for segmentation area. In our case, it is building available portion. Dice coefficient is similar to Jaccard loss (IOU).

TP: True Positive, FP: False Positive, FN: False Negative

Intersection over Union is simply an evaluation metric. Any algorithm that provides predicted bounding boxes as output can be evaluated using IoU.

More formally, in order to apply Intersection over Union to evaluate an (arbitrary) object detector we need the below (applicable for segmentation also):

  1. The ground-truth bounding boxes (i.e., the hand-labelled bounding boxes from the testing set that specify where in the image our object is).
  2. The predicted bounding boxes from our model.

As long as we have these two sets of bounding boxes, we can apply Intersection over Union.

Read more about IOU here.

Test Images: Followed the same preprocessing steps like training images and our predictions are the output of sliced test images (256×256). After predicting the mask for sliced test images, we stitched back the predicted mask into original size(5K×5K) image. The information which we retained during the preprocessing step is helpful here.

Leaderboard Result: After submitting the results to Inria website, we got an accuracy of 96% and IOU of 70%.

Left: Predicted image on top of test image (256×256); Right: Test image (256×256)

P.S: I can not share the code of this project (done as part of fellowship.ai).

I am happy to discuss further for any clarifications. Any suggestions to improve the blog/project approach is appreciated.

Project Contributors: Jhansi Anumula, Pallavi Allada, and Zoheb Abai

The Startup

Medium's largest active publication, followed by +504K people. Follow to join our community.

Jhansi Anumula

Written by

Deep Learning Intern

The Startup

Medium's largest active publication, followed by +504K people. Follow to join our community.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade