part 1:(Paper explained) Tiling and stitching segmentation output for remote sensing : basic challenges and recommendations

Maru Tech 🐞
8 min readMar 9, 2023

--

The goal of this paper is to shed light on various tiling and stitching techniques and how they can be exploited in a way that maximizes label accuracy , minimizes computation and overcome translational variance problem .

  1. Introduction

nowadays CNN has become one of the most widely used methods in computer vision , due to its unprecedented performances on several benchmarks dataset , but this success has encountered some crucial problems headed by computational requirements .
When dealing with high resolution data , the input image will have an intense amount of pixels , which means that the number of parameters and operations required to process the data will increase significantly. thus the larger the input the more computation required to process the data , and this can make it harder for the GPU to keep up with model demands and requirements which can lead to slower training and longer inference times
Well , to mitigate this problem several techniques has been adopted , one of which are tiling and stitching , these two techniques are a common solutions especially for remote sensing and biomedical data .

What is tiling

it is the process of dividing a large image into smaller regions termed patches . This is often done to facilitate the processing of large images, which can be computationally expensive to analyze as a whole .
these patches are then processed individually , the resulted label patches will be merged back together in a process termed stitching while performing some type of operation in order to create a seamless mosaic composite image that appears as if it was taken in a single shot

at most of the time the input patches are extracted from the original image in an deliberately overlapped manner and the intuition behind this is to ensure that objects on the boundary of two adjacent tiles are not missed , beside gaining more context for each individual patch

note : the size and shape of the patches can vary depending on the application and the image being processed

However , towards the goals of achieving the best trade off between computational cost and label accuracy , the authors have encountered a central finding as they call , which says that all the stitching existing methods are motivated by the absence of a perfect translation equivariance

translation equivariance refers to the property that the output of an image segmentation algorithm should be invariant to translations that was performed on the input image. i.e , if the input image is translated by a certain amount, the output segmentation should also be translated by the same amount, while preserving the locations and boundaries of objects and especially label values .
However , as we have said earlier due to the challenges of processing large remote sensing images, it is often necessary to break the input image into smaller tiles and process them separately. This process can introduce misalignment between tiles after concatenating them back together, due to the poor results of each patch that was caused by the absence of the perfect equivariance
which by consequence can affect the quality and translation equivariance of the whole output segmentation as it is depicted in figure 1

fig1. each row is translated horizontally with respect to the CNN , we can see that pixel labels beside the edges of the patches are receiving a different label value for each row hence causing in a dissimilar results

so in this paper they have provided some recommendations for stitching to optimize accuracy and enhance the output tile quality and smoothness .

To provide you with a comprehensive understanding of this paper, this blog post is organized into two parts. In the first part, we’ll briefly discuss the datasets and networks that have been adopted then we explore the various types of stitching methods and their relationship with translation variance.

For the second part, we’ll focus on the causes of translation variance and offer some practical suggestions for achieving better tiling results .

2. Datasets :

2.1. The INRIA building labeling dataset (D1)

The INRIA Building and Road Detection Dataset, also known as the INRIA Aerial Image Labeling Dataset, is a popular benchmark dataset for object detection and image segmentation tasks related to aerial imagery.

it consists of 180 RGB images captured by an airborne camera. they have considered only 5 cities out of 10 in their experiment (Austin , Chicago , Kitsap , Western Tyrol and Vienna ) , The images have a resolution of 5000x5000 pixels , and a total of 36 sample over each city .

2.2. The solar array labeling dataset (D2)

A dataset that is designed for object detection and segmentation tasks related to solar panels in satellite imagery. it was created by the European Space Agency (ESA) and is part of their efforts to support the development of renewable energy sources.

The dataset consists of high-resolution satellite RGB images captured over various locations in Europe, Africa, and Asia. The images were captured by different satellite sensors and have varying spatial resolutions, ranging from 0.5m to 2.5m. there are over 5000 images that covers an area of approximately 10,000 km².

3. Networks

3.1. U-Net :

U-Net is a popular deep learning architecture used for medical image segmentation tasks , it consists of a contracting and an expansive path.

The contracting path is a series of Convolutional and pooling layers that reduces the spatial resolution of the input image while increasing its depth to capture the contextual information , while the expansive path consists of a series of up-sampling layers to recover the spatial information in order to extract the WHERE information from the input image

The U-Net architecture has become popular for image segmentation tasks, due to its ability to capture fine-grained details and produce accurate segmentation masks even for small objects , and this is with the help of skip connections which are implemented by concatenating feature maps from the contracting path with the corresponding feature maps from the expansive path. This allows the network to combine low-level features from the input image with high-level features learned from the context, resulting in more accurate segmentation masks.

for this paper they have applied a small modification the the U-Net original architecture by halving the number of filter in each convolution

3.2. DeepLabV2 :

DeepLabv2 employs an encoder-decoder architecture, where the encoder is a pre-trained convolutional neural network (CNN) such as VGG or ResNet, and the decoder consists of several deconvolutional layers that upsample the feature maps to the original image resolution. The architecture utilizes atrous (dilated) convolutions in both the encoder and decoder, which allow the network to capture multi-scale contextual information.

in the paper they have used DeepLab V2 with a resnet101 backbone beside eliminating the conditional random fields (CRFs) network .

Training :

D1 : first 5 tiles in each city for validation, the remaining is for training

D2 : first half of samples in each city for validation , the remaining is for training

Loss function : Discrete cross entropy

Batch size : 5

Patch size : (572 x 572 ) U-Net , (321 x 321) DeepLab

Optimizer : Adam b1 = 0.9 , b2 = 0.999 , epsilon = 10^-8

Epochs : 100 => 8000 patch per epoch

Grid search for hyper-parameter

U-Net : LR = 10^-4 => 10^-5 after 60 epochs

DeepLab : LR = 10^-5 => 10^-6 after 60 epochs

4. What is stitching

Stitching in remote sensing refers to the process of aligning and blending multiple images to create a composite , seamless mosaic of a larger area. This process can be done manually, but it is often automated using specialized software that can detect and correct for differences in image perspective, illumination, and other factors that can affect the image alignment.

5. Variations of stitching

5.1. Label clipping

it involves removing the pixels that reside near the edges of the output label patches assuming that they exhibit a high error rate comparing to central pixels due to zero padding which is similar to context loss , and this is a special case of TV (Translation Variance) because each pixel once moved towards the edges via translation , will receive a worse i.e different label , so clipping the label edges will be beneficial to overcome TV mismatched predictions .

5.2. Label averaging

The process of label averaging typically involves first aligning the corresponding regions of the individual images or patches, and then computing a weighted average of the labels within each pixel or region. The weights are typically based on the quality or confidence of each label, and may be determined using methods such as Gaussian weights or morphological weights.

One of the benefits of label averaging is that it can improve the accuracy and consistency of the labeled image by reducing the impact of noise or errors in individual labels. By combining multiple labels, it is possible to capture a wider range of variations in the appearance and shape of objects, and to average out any errors or inconsistencies in individual labels.

nevertheless , a 2 overlapped images are just one image with a translation respecting to the network input , and since these networks are not perfectly equivariance the spatially coincident labels will have diff values although they encode the same area of pixels , and this what motivates label averaging methods since it can reduce the impact of noise or errors in the individual coincident labels

5.3. Label concatenation

The process of label concatenation typically involves first aligning the corresponding regions of the individual images or patches, and then concatenating the labeled regions without any clipping or averaging , this method deny the benefits of the previous ones , beside assuming in some cases that networks are perfectly transational equivariant .

Conclusion

Well this is all for the first part of the blog post 🎉🎉🎉✨✨✨✨🎆🎆

Thank you for reading ! i hope you’ve gained a better understanding of how this technique can improve image segmentation tasks. However, we’ve only scratched the surface of this fascinating topic. In the second part of this series, we’ll delve deeper into the inner causes , and recommandations . So stay tuned, see you in the next post!

References :

Tiling and stitching segmentation output for remote sensing: basic challenges and recommendations
Bohao Huang1, Daniel Reichman1, Leslie M. Collins1, Kyle Bradbury2, and Jordan M. Malof

--

--

Maru Tech 🐞

Deep learning & computer vision engineer | Algeria | Data And Beyond Author