On the Use of Attention Map for Land Cover Mapping

Published in

Analytics Vidhya

16 min readFeb 17, 2021

Today, I will review my work, which is my thesis, when studying for a Master’s degree. This work will be about implementing AI in the field of remote sensing. I will briefly describe what remote sensing is about and how it will be used in this work.

Introduction

Remote sensing is the acquisition of information about an object or phenomenon without direct physical contact with that object or phenomenon. Remote sensing is used in many fields, especially in the field of geography. Land surveys and earth sciences such as hydrology, ecology, oceanography Glacier science, and geology, including military Secret service, trade, economy, planning, and humanitarian use

At present, remote sensing generally refers to the use of air sensing technology to detect and classify objects on Earth, both on the surface of the Earth’s ocean and at the Earth’s atmosphere by the principle of distribution of waves such as electromagnetic waves. This may be categorized as active remote sensing when an aircraft or satellite is self-emitting. Or self-powered remote sensing (passive remote sensing) by receiving information from the reflection of other energy such as sunlight.

Therefore, the remote sensing that we are most familiar with and closest to us. Is a normal camera Because we get information about objects such as color, texture, characteristics. Without touching the object

Currently, satellite images are beneficial and essential in many fields. Weather forecast uses satellite images to analyze the changes in the Earth’s atmosphere. Disaster management uses satellite images to identify and assess the situation to develop a timely and suitable response. Governments can use satellite images for crop monitoring for a better policy in agriculture product price scheme. Therefore, we see that satellite images are beneficial because they can apply to many tasks. Currently, research in artificial intelligence in image or computer vision has received a lot of attention. Both hardware and software have been developed to speed up the image analysis tasks. One of the most influential and popular techniques in artificial intelligence is Deep Learning, which is used as the main structure for many tasks. With this technique, artificial intelligence can deal with problems better. Higher accuracy than previous techniques, Therefore, we used the Deep learning technique with satellite images in this work.

In this work, we design a convolution neural network training model to reduce resources, labeling data for training for segmentation, or satellite imagery, land cover mapping of objects on the Earth’s surface. This process is to assign a class to each pixel of the image. We will specify which areas of the image are what kinds of objects since the satellite images are usually huge and can collect much information. Since it is a collection of images from satellites that orbit all the time, even though it can store many data, labeling takes much time and requires experts to do it carefully. That takes much time to prepare data for training models. Moreover, sometimes the data with quality labels are hard to find. Therefore, we adjusted the model training. Typically, in a general model, U-Net uses pixel-level labels to specify classes for every pixel in the image. That takes a lot of time and effort. However, in our work, we have adapted to labeling by simply saying whether the image has a class that appears or not. That allows us to save time and expertise.

To achieve this goal, we used the Attention layer to separate each class to create an Attention map for land cover mapping. The Attention layer is included in the model structure. Make it learn along with the model while training. Finally, we will take the Attention map through the process and convert it into a segmented image.

You will see that this is a task that requires a lot of time. To identify the class in every position. The larger the image volume and the higher the detail, the more time it takes

Objectives

From the introduction that has already been mentioned, Therefore, the primary purpose of this work can be identified as follow:

To design a model using deep learning for land cover mapping of satellite imagery by reducing the labeling resources for model training.

To reduce the time of data labeling.
To segment the image Without needing to label every pixel.
To design a convolutional neural network architecture that gives the land cover mapping using the “Attention map.”

Materials and Methods

1. Dataset

Images from the UC Merced Land Use dataset (Yang & Newsam, 2010) was used in our experiment. This dataset is divided from the large image from the USGS National Map. Images in this dataset contain 2100 images and 256 x 256 pixels with 0.3 meters resolution. All images are captured in the RGB color space, and in this dataset, 17 land cover classes composed of airplanes, buildings, cars, court, dock, mobile homes, pavement, ship, tanks, sea, water, bare soil, sand, chaparral, field, grass, and trees.

The dataset we used for this work is called the UC Merced land-use dataset, which is a 3-branch satellite image, RGB with a resolution of 30 cm.

2. Pre-processing data

2.1 Combine similar classes

We grouped them from 17 classes into 4 land cover classes: impervious, water, land, and vegetation due to the limited number of training images.

Impervious (Man-made) contains airplanes, buildings, cars, courts, dock, mobile homes, pavement, ships, and tanks.
Water contains sea and water.
Land contains bare soil and sand.
Vegetation contains chaparral, field, grass, and trees.

We combine classes because the data have too many classes compared to the amount of data in each class. That is why we have little data to train the model. So, we combine classes with similar meanings. Moreover, this also makes the model easier to learn because fewer classes reduce model confusion. Moreover, it can also solve the problem of class imbalance.

Combine similar classes together from 17 classes into 4 classes

3. Data normalization

The normalization That can be done in many ways in this work will scale the image’s pixel value by using standardization, which is computed from each channel of the red, green, blue image. computed by finding the mean and standard deviation of each channel of all data, with this method we found that the result is better than scale into 0-1 range and we assume that our training data is big enough to be a representation for testing data so we will scale testing data with the same mean and standard deviation that we compute from training data

We found that this method is standardization by calculating separately for channels RGB gets better results

4. Split image

Our network must learn the corresponding between the observed data in an image and class label by itself. We must have a dataset with some training samples having one or two land cover classes. If all of the training data have all land cover classes present in all images, our network will not learn the differences between classes. To achieve this goal, we divide all images and resulting ground truth into a smaller size of 64 x 64 pixels resulting in a total of 33,600 images

Let’s look at an example explanation in the case of The classify between dog and cat if every image consists of dogs and cats all the time. And our label is image-level, that is, it just tells what image a dog or cat appears. And the model didn’t know dogs and cats before. That will confuse the model, unable to distinguish that What calls a dog or cat? But if we have images of a single dog, alone, and a cat alone, in the end, the model will learn to distinguish the characteristics of the dog and the cat.

5. Label the image using the classes in the ground truth

Since our work does not require a pixel-wise label, we have to say which classes appear in the image. So, we can create a new label from the pixel-wise label by performing one-hot encoding

Label the image using the classes in the ground truth

So, if the image size is 64x64 pixels, if using a pixel-level label at 1 image, it will require 4096 labels while an image-level label is used only in 4 labels where we have 4 classes. Which can be seen that it varies with the number of classes Not the size of the image So, at 1 image, pixel-level more than 1024 times the labeling effort is required.

6. Proposed model architecture

In this work, we proposed a neural network to specify the interested class’s localization by segmenting an image-based, only the presence or absence of land cover classes. This process does not require the pixel-by-pixel labeling as in U-Net or bounding box labeling as in Faster R-CNN and YOLO. The network used in our paper. Here, we have included a concept model from ResNet and NIN (Network in Network) where skip convolution layers are used to avoid vanishing gradients with a filter of size 1 × 1. The Global Average Pooling (GAP) is also employed to include predictive features instead of a Fully Connected layer, preserving the feature map layer’s spatial information. GAP also acts as a regularizer to prevent the network from overfitting. The model does not reduce the image size much because we want the resize to maintain spatial resolution. Convolutional layers use a filter of size 3×3, and all convolutional layers are followed by batch normalization and ReLU activation function, respectively.

Our model structure is deep learning. The model is divided into two parts for feature extraction or base network and attention mapping. In the find decision making, and attention map, and feature map, multiply element-by-element. The attention map uses the softmax activation function with the number of channels equal to the number of classes. (in our example is four for land, impervious, vegetation, and water) our model will assign each class to each different channel in the attention map after training. So the difference between an Attention layer and a feature layer is an Attention layer followed by a softmax activation function Will be estimated as a probability map Where each pixel value is between 0–1

The softmax activation function is used to map the values of the attention map in each channel to have a combination of values across the channel equal to 1 because we want to separate each channel as each class, so each pixel value of each channel is the probability

In other words, on the Attention layer we use the softmax activation function to estimate the probability, which will be in the range 0–1, so at any position when the feature map is multiplied by the attention layer, if the probability is close to 0, then that feature is not passed. But if the probability value is close to 1 then it will pass

Using filtered features to make decisions, we found that using the softmax activation function at the spatial feature for converting to probability values gives the best results. Since each channel was mapped to a class output, that made the model learn to separate each channel map to the order of classes as we defined by labeling correctly. The order of classes is land, impervious, vegetation, and water, respectively. Moreover, if we used the sigmoid activation function at the final output, the model could correctly predict the apparent class, The Attention map will lose spatial data. Moreover, the model cannot separate each channel as we assign it appropriately.

Moreover, we use both Global Average Pooling and Global Max Pooling to summarize the feature and merge it by averaging. This method gives the best results significantly compared with only Global Average Pooling and Global Max Pooling alone. Global Average Pooling gives the summary feature, which may not indicate the unique feature that makes the model hard to classify the class. However, this can enforce attention layer be able to retain the spatial feature, and Global Max Pooling gives the unique feature that makes the model classify the class; however, this will make the attention layer ignore the spatial feature and focus on the only final output So, the right way is to combine it.

Compare the three methods and the method that gets the best results is the third method

Here, the attention map acts as a filter to enable or disable the feature extracted by the feature map passing through the decision-making process. Since the feature maps and Attention maps have the same size and the number of channels, and the multiplication layer is the element-wise multiplication between them. Each feature map has a corresponding attention map. Thus, if a pixel in one channel of the attention map associated with a land cover class (Say A) has a zero value, the feature extracted in this pixel from the corresponding channel in the feature map will be eliminated. Thus, the final decision regarding the presence of will not have any information regarding the feature from class A in this pixel. In contrast, if the same pixel and the same channel has a value of one, the final decision will be made from the information in this pixel with the feature of class A. From this idea, it is clear that the attention map will have a value of one in a given pixel if the feature of the underlying class is present in this pixel

The attention layer acts as a filter to enable or disable the feature extracted by the feature map passing through the decision-making process

7. Land cover mapping (Segmentation)

For land cover mapping from the attention map, the number of channels in the attention map is equal to the number of classes. We compare each class’s pixel value or each attention map and use the highest value to indicate the land cover class. Hence, every pixel is assigned as

Compare the pixel values of each attention map and choose the highest value to indicate the class

where 𝑆(𝑥, 𝑦) is the class of the pixel (𝑥, 𝑦), and 𝐴𝑐 is the attention map of class 𝑐.

The steps to get a land cover map consists of the following main steps. The first step is to feed the input image into the model. To get output, our model will be able to provide two kinds of outputs. One is the probability of each class’s appearance in the image and the attention map of each class, which we will take the attention map through to a process to get the land cover map again. The second step compares each attention map’s pixel values and chooses the highest value to indicate the class.

The process of land cover mapping begins by input images into the model. Which we will set the model to output as an Attention map.

we compare the pixel probability values for each layer and select the maximum probability value to assign a class to that pixel. And this is the result We can show in the form of Color by mapping the color to each class.

Results

Our model and U-Net provide the segmented image or land cover mapping, and we will measure the performance of our model for both image-level and pixel-level accuracy. In contrast, the Kappa coefficient is only used for the pixel-level accuracy. Table. 1 and Table. 2 provide the image-level and pixel-level accuracies with the 107,520, respectively. It can be seen that our algorithm outperforms U-Net significantly since 107,520 labels in our algorithm correspond to 26,880 images. In contrast, U-Net uses only 27 images.

Table. 1 and Table. 2 provide the image-level and pixel-level accuracies with the 107,520, respectively. It can be seen that our algorithm outperforms U-Net significantly since 107,520 labels in our algorithm correspond to 26880 images. In contrast, U-Net uses only 27 images.

**Table 1** Image-Level Overall Accuracies at the same number of labels

**Table 2** Pixel-Level Accuracies at the same number of labels

Next, we examined the performances of our algorithm and U-Net for the different number of training samples. From Table. 3, it can be easily seen that both algorithms perform better as the number of training samples increases. However, the U-Net algorithms require much more effort in the labeling processes. At 100% of training samples, the differences in overall accuracy between U-Net and our algorithm are less than 3%, where the U-Net requires about 1,000 times more labeling.

**Table 3** Pixel-Level Accuracies as a function of training sample size

In Table. 4, we investigate the class-level accuracy. We found that, at 100% training samples, the F-1 scores of U-Net is slightly higher than our proposed method. Note again that U-Net requires around 1,000 times more labeling process.

**Table 4** Pixel-Level Accuracies in each class for the case of 100% training samples

We also investigated the effect of image size on accuracy. Since the image size is large, there are higher chances that the image will contain 2 or more classes. If all training images contain both land and vegetation together, our algorithm cannot find the distinct features between these two classes. To be more specific, It is obvious that land and vegetation have brown and green colors. Our algorithm sees both brown and green color and only knows that there are vegetation and land in the image, but it cannot make a brown link for land and green is for vegetation. Table. 5 demonstrates our hypothesis, as the image size more extensive, the accuracy of our algorithm decreases, whereas the performance of U-Net is roughly constant.

**Table 5** Pixel-Level Accuracies as a function of image size for the case of 100% training samples

plots the required number of labels vs. the overall accuracy. It is clear that if only 80% OA is required, our algorithm can save roughly 1,000 times the number of labels that can save days or weeks preparing the training samples.

**It is clear from the graph that, if accuracy of 80% is required, our method can save approximately 1000 times of data labeling. Which can save resources both time and money**

And this is the final result, land cover mapping or image segmentation at 100% number of the sample (our method uses less than 1,024 times the number of labels compare to U-net).

(a) attention map for land, (b) attention map for impervious, (c) attention map for vegetation, (d) attention map for water, (e) observed image, (f) ground truth, (g) proposed algorithm, (h) U-Net, (land: brown, impervious: gray, vegetation: green, water: light blue) (Results from the test set)

Conclusion and Recommendation

Conclusion

We proposed a new land cover mapping algorithm using the attention layer for remote sensing images. Our approach’s advantage is that a training set is required to label just four times per image by identifying presences and absences of land cover classes in the entire image rather than label every pixel in the image. This labeling process enables a large amount of training set to be developed quickly. To achieve this goal, we designed our network such that each channel of attention layer corresponding to one and only one land cover class. Thus, the attention layer acts as a switch to allow information from the feature extraction layer to pass through the final decision process. As a result, the only pixels with a given land cover class’s strong presence can pass through. We examined our algorithm’s performance against U-Net. We found that with the same number of labels, our algorithm can achieve an overall Pixel-Level accuracy of 82.86%, whereas U-Net can only obtain 59.71%.

Recommendation

We found that our model will handle images with a small number of classes. Moreover, poorly handle spatial details and the class location because the label we use only tells that class it appears or not. Therefore, the information used to train the model is very little. Moreover, the model did not learn the class’s spatial position or each class’s density in the image.

Moreover, for that reason, although the overview of our model works satisfactorily. Moreover, get high accuracy at the same number of labels as U-Net, but that may come from because in the test dataset there are The images are up to 30% in a single class and while U-Net handles spatial resolution and location better So in order for our model to work better And can solve the problems mentioned above We may need to find other ways to increase the accuracy of the model. The potential approach is to Active learning, that is, if we need to do more labeling. How do we get smart and effective labeling? How do we choose

The way this might be done is instead of giving equal importance to every data and its labels. We may prioritize and label the most important data and will most likely help the model perform better. Because if we do not prioritize and assume that all information is equal, We will sometimes find that sometimes the results are useful when randomly selecting data to train. Moreover, sometimes wrong. Therefore, every data is different. However, it also depends on the problem we are facing. We want to find data that is an excellent representative to train the model in simple terms.

So, with our approach, we initially just said which classes were present. That allows us to use a small number of labels to train the model. And that gives us the baseline model. We will use this model to determine how important the data is. We then choose to label the data and train the new model with more new data. Then repeat this process loop. That will make the model development process more efficient. When we consider terms of the resources required for labeling data. With enhanced model performance

There are several ways to prioritize data to be labeled. For example, the so-called Least confidence method is where we use a model to predict unlabeled data. Then when we got the answer Select the maximum probability of each data that the model predicts to be that class. Then, sort this maximum probability value in ascending order, so we prioritize each data to be labeled. We will choose to label data with a low probability first because of Predictive models with low confidence. That means the model is still confused with this data.

Moreover, we are not learning well or have some features That make models hard to learn. So, if we label this data, It should help the model to learn better overall. When dealing with simple data, it should work just fine. Alternatively, when faced with complex data. It should work better because we can add additional labels in training.

Another issue that we encounter with the dataset is the class imbalance problem we have dealt with by combining classes. Moreover, perform weight loss function, but there are other ways that might help, such as oversampling and undersampling. However, it may have to lose some of the data but may make the model work better.

Moreover, another issue that may be tried is the adaptation of the base network(feature extraction part). We have done some experiments, but it is impossible to test the hypothesis enough with the limited time. To see the significance of this refinement but as far as I tried to think that there is a certain

Finally, thank you for everyone’s attention, hopefully, there will be more or fewer benefits to those who are interested. If there are any mistakes, I apologize here too.

References

[1] https://ieeexplore.ieee.org/document/9158220

[2] https://github.com/MailSuesarn/On-the-Use-of-Attention-Map-for-Land-Cover-Mapping (included dataset )