Tree Segmentation on RGB and nDSM Rasters Using Pyramid Scene Parsing Network (PSPNet) in ArcGIS Pro. — Part 1: Without a single line of code!

Yasin Koçan
GeoAI
Published in
17 min readApr 25, 2022

by Yasin Koçan, Oktay Eker

We decided to create an introductory series on tree segmentation by using PSPNet and some deep learning applications by using ArcGIS Pro. In Part 1, we show this using only ArcGIS Pro’s UI, without writing a single line of code! Moreover, for Part 2, we are planning to do the same operations using Python scripting inside ArcGIS Pro.

Introduction

Tree segmentation has been a widely studied topic in the geospatial and environmental domains for decades. In this experiment, we focused on checking whether the model’s accuracy can be increased by adding new tree samples from different data. More specifically, we answer the following question: “Can we increase the existing model accuracy by fine-tuning the existing model with finer resolution imagery taken in a different year?”. We believe this is an important topic to experiment with because it might lead to improvements simply by using imagery that is already available, regardless of its spatial resolution up to some degree.

In section “A. Training on 0.5 m per-pixel resolution RGBD 2014 data” below, we focus on the training, validation, and testing steps of a model by using ArcGIS Pro deep learning tools. A model is trained on relatively low-resolution imagery (0.5 m per-pixel RGBD raster), which is later fine-tuned by higher resolution (0.1 m per-pixel RGBD raster) imagery.

In section “B. Training on 0.1 m per-pixel resolution RGBD 2021 data”, the data preparation steps are pretty much the same; however, there are a few tricks and minor differences that we cover.

In section “C. Fine-Tuning The Pretrained Model”, we explain how the existing model (2014 low-resolution model) is fine-tuned with 2021 high-resolution imagery.

In section “D. Accuracy Assessment”, we compare the following models in terms of precision, recall, and F-1 Scores:

1. 0.5 m RGBD 2014 Data (Initial Model)

2. 0.5 m RGBD 2014 Data (Fine-Tuned Model)

3. 0.1 m RGBD 2021 Data (High-Resolution Model)

4. 0.1 m RGBD 2021 Data (Fine-Tuned Model)

5. 0.1 m RGBD Upsampled 2014 Data (Fine-Tuned Model)

Pyramid Scene Parsing Network

The Pyramid Scene Parsing Network, or PSPNet, is a semantic segmentation approach that employs a pyramid parsing module to leverage global context information through different-region-based context aggregation. The combination of local and global clues improves the accuracy of the final prediction.

For more information, you can check the original paper and the code in the links below:

· Pyramid Scene Parsing Network [1]

· The code [2]

We use PSPNet here to detect trees in a given RGB and nDSM pair. We show how to create training and test data and evaluate the results using ArcGIS Pro deep learning tools. ArcGIS Pro has a ready-to-use implementation of PSPNet inside ArcGIS API for Python, and users can rely on it to work with their geospatial rasters seamlessly.

You can check How PSPNet works on the ArcGIS Developer website.

Training, Validation, and Test Data

1. RGB 0.5 m resolution from 2014 data

2. 0.1 m resolution RGB, DSM, and DTM rasters for 2021 data*

3. DSM raster (2014) and DTM raster (2014)

4. Normalized DSM raster 0.1 m from 2021 data**

5. Normalized DSM raster 0.5 m from 2014 data**

*Our business partner Wingtra AG collected oblique imagery of a region in Zurich using their WingtraOne GenII light drone where there is no active LIDAR sensor involved in the source data collection. Sure for ArcGIS is used to generate photogrammetric mesh from oriented imagery and then the resulting mesh is converted into orthorectified RGB and nDSM rasters. For more information, check “Creating City-Wide 3D Meshes with Drones” by Jeremiah Johnson.

**Normalization was done in ArcGIS Pro by subtracting DTM raster from DSM, making the ground pixels being close at 0.0 values. By doing so, instead of dealing with absolute height values, we deal with height-above-ground values.

Installing the Deep Learning Libraries

ArcGIS Pro, ArcGIS Server, and the ArcGIS API for Python all incorporate tools to use AI and Deep Learning to unravel geospatial issues, such as feature extraction, pixel classification, and feature categorization. To install the libraries, see Deep Learning Libraries Installer for ArcGIS.

Overview

Figure 1: The flowchart of the proposed experiment. (Click to enlarge.)

You may notice that we used different nDSM threshold values because 2014 data is cleaner and trees are more detectable due to lidar. However, in 2021 data, due to photogrammetric mesh, some shady areas might look like trees in nDSM. So, the bigger threshold was used.

A. Training on 0.5 m per-pixel RGBD 2014 data

Within the scope of the experiment, we are trying to find the trees that hang over the building roofs so we can eliminate them to reconstruct the building geometries that represent the shape with a smoother surface. Thus, we used a three-meter threshold on the nDSM raster to eliminate the trees and small bushes below the given threshold. For 2014 data, there is an alignment issue with RGB and nDSM bands; the content may be incorrectly shifted up to a meter due to orthorectification errors, specifically for tall buildings. Since the study is not only based on RGB (orthophoto) but also nDSM (normalized elevation raster) in training, the training data must be prepared carefully. To be more specific, a pixel might be a part of a tree in an orthophoto but not in an nDSM because of the three-meter threshold. We used layer blending in ArcGIS Pro to visualize RGB + nDSM as pseudo-3D. If you are curious about how to do that, the video below shows a sample of how we create the tree polygons.

Figure 2: Creating training data. To have a pseudo-3D visual, first, you can change the symbology of the elevation (nDSM) layer to hillshade in the desired color scheme. Then, you adjust the transparency of the RGB layer (orthophoto) and layer blending option into “Multiply” from the Appearance tab. Note that this is not an obligatory process but enhances visualization for precise drawing. (Click to enlarge.)

Around 3,900 polygons represented the areas containing tree samples taken from different regions of the map. When you are creating training data, try to digitize samples from different areas of maps. Otherwise, your model might be prone to overfit. Also, note that you don’t have to draw individual tree polygons unless you are doing an instance segmentation. The main goal for us in this experiment is to detect tree pixels rather than individual trees. Thus, we covered all the tree pixels regardless of individual tree boundaries. It is good to have the training data boundaries (simple coverage masks containing the training data samples shown in yellow below) as a separate layer to minimize the false positives in these areas.

Figure 3: The training samples for 2014 Zurich data. There are around 3,900 tree polygons from different areas of the map to reduce overfitting. The areas represented in yellow show the training data boundaries.

After creating the training data samples and the masks as a feature class, use the “Export Training Data for Deep Learning” tool to create the image chips for training. The parameters for the tool are explained below:

· Input Raster: The 2014 RGBD (D represents the nDSM band) raster. In this experiment, we used 32 bits per channel to preserve the details in nDSM band.*

*If you are not familiar with how to create a multiband raster, check the “Composite Bands” function, which creates a single raster dataset from multiple bands.

· Output Folder: The output folder for image chips, labels, Esri model definition file, etc. This folder will be used as input in the next step of training.

· Input Feature Class Or Classified Raster Or Table: The feature class that contains the tree samples.

· Class Value Field: In this experiment, we are doing a binary classification (tree or not-tree). It is good to add a field that represents the class to the feature layer that contains the tree samples. In this case, you can simply add a field, then use the Calculate Field function to set all the values to 1.*

*Important Note: If no field is specified, the system searches for a value or classvalue field. If the feature does not contain a class field, the system determines that all records belong to one class.

· Input Mask Polygons: A polygon feature class that specifies the area where image chips will be created.

· Image Format: Specifies the raster format that will be used for the image chip outputs. In this example, we used TIFF format.

· Tile Size X and Y: The size of the image chips for the x and y dimensions. For 0.5 m data, we used 128 as tile size which covers (128*0.5 m)*(128*0.5 m)=4,096 square meters.

· Stride X and Y: The distance to move in the x and y direction when creating the next image chips, respectively. For example, when the stride is equal to half the tile size, there will be a 50 percent overlap. For 0.5 m data, we used 50% overlap (128/2=64)

· Rotation Angle: We used 0 as the rotation angle.

· Reference System: The reference system that will be used to interpret the input image is specified here.

· Output No Feature Tiles: If you want to export the image chips only for the ones capturing the training samples, you can leave it unchecked.

· Metadata Format: ArcGIS Pro Deep Learning toolbox offers different models and solutions that run on different types of image chips. In this experiment, we used Classified Tiles for PSPNet.

· Environments Tab: In this tab, the processing extent and cell size for the output can be specified. In this experiment, we used the default values.

Important Note: Before running the tool, be sure to clear the selection. If one or more polygons are selected, the tool will run without error but the output folders might be empty or contain only a subset of the entire dataset depending on the size of the selection.

To learn more about the tool, see Export Training Data for Deep Learning.

Depending on the size of the training data, it may take a while to create the image chips. Once the export completes, the output folder should look like this:

Figure 4: The folder that contains the image chips and the model definition file.

Important Note: We suggest you open the image chips after exporting and check whether any tree samples were left undigitized. Otherwise, the tree samples will be considered non-tree samples, and this could decrease the accuracy of the model due to false negatives. After you add these tree samples to the training data feature class, you can run the Export Training Data for Deep Learning Tool again to create the corrected image chips. The final image chips should look as shown in Figure 5 below:

Figure 5: Corrected overextended image chips. The grey areas show the image chip boundaries where there should be no tree samples left undigitized.

Training the initial neural network

After verifying the image chips are correctly exported, you can train the model by using the “Train Deep Learning Modeltool. This tool utilizes the GPU and allows parallel processing. If you are not sure how to configure your GPU, see GPU processing with Spatial Analyst.

As mentioned above, we used PSPNet as the deep neural network in our experiments. For 0.5 m 2014 data, the training parameters are given below. Also, you can set the parallel processing factor and processor type (GPU or CPU) from the Environments tab.

· Max Epochs: 20

· Batch Size: 8

· Learning Rate: Empty (Default)

· Backbone Model: RESNET 34

· Validation: 10%

· Stop when model stops improving: Checked

· Freeze Model: Checked

With the parameters given above, the training took around 24 minutes on the personal laptop with GTX 1060 GPU. You can change the parameters in your experiments. To have detailed information about the toolbox and the parameters, check “Train Deep Learning Model. Resulting metrics on the automatically chosen Validation set:

· Precision: 0.877554

· Recall: 0.877554

· F1-Score: 0.854696

Before the training starts, the tool splits the input image folder randomly into Training (90%) and Validation (10%) sets. The final metrics shown by the training tool at the end are calculated on the Validation set. The models are tested on common test data because we are in control of the test samples. We are not in control of the Validation set, because it’s picked randomly at training time. For example, different models have different Validation sets, although were trained on the same input training chips folder. This is why the models are tested in a different test area and the results are discussed in the “D. Accuracy Assessment” section.

B. Training on 0.1 m per-pixel resolution RGBD 2021 data

Besides low-resolution (0.5 m) 2014 data, we trained another model on higher resolution 2021 data with 0.1 m resolution. The purpose of the training is to see whether there is a correlation between spatial resolution and model accuracy.

The 2014 elevation raster was produced with a lidar scanner whereas 2021 data is produced with aerial photogrammetry techniques. Although 2014 LIDAR data has a lower resolution, the trees are more distinguishable and less noisy. Nevertheless, in 2021 photogrammetry data, there is some noise in the shady and dark areas that look like trees. Elevation values of less than 5 meters are filtered out in nDSM to minimize non-tree samples or small bushes in the training data preparation step. The rest of the workflow is pretty much the same as “A. Training on 0.5 m per-pixel resolution RGBD 2014 data”; however, different parameters are used due to different resolutions. There are 670 tree samples used in this training.

· Input Raster: The 2021 RGBD (D represents the nDSM band) raster. We used 32 bits per channel to preserve the details in the nDSM band.

· Input Feature Class Or Classified Raster Or Table: A feature class representing tree polygons for 0.1 m 2021 data.

· Class Value Field: Similar to the previous training, it is good to add a field representing the class to the feature layer and set the class value to 1.

· Input Mask Polygons: Training data mask for 2021 tree samples.

· Image Format: Specifies the raster format that will be used for the image chip outputs. In this example, we used TIFF format.

· Tile Size X and Y: For the 0.1 m 2021 data, we used 512 pixels as tile size that means 512*0.1 m = 51.2 m height and width for each tile. With this tile size, we can cover multiple tree samples in a single tile. If the tile size is too small such that a majority of tree canopies are represented only partially, the model might learn poorly. You can try different tile sizes in your experiments.

· Stride X and Y: The data size is increased with the square of the spatial resolution. Since we are working with 32 bit * 4 bands (RGBD) rasters in this experiment, to decrease the tile count we used 0 strides for 0.1 m 2021 data.

For the remaining parameters, we used the same values that were used in the “A. Training on 0.5 m per-pixel resolution RGBD 2014 data” section above.

“Train Deep Learning Model Parameters:

· Max Epochs: 20

· Batch Size: 8

· Learning Rate: Empty (Default)

· Backbone Model: RESNET 34

· Validation: 10%

· Stop when model stops improving: Checked

· Freeze Model: Checked

With the parameters given above, the training took around 57 minutes on a personal laptop equipped with a GTX 1060 GPU. Resulting metrics on the automatically picked Validation set:

· Precision: 0.805829

· Recall: 0.794827

· F1-Score: 0.800291

Similar to 2014 training, the model is tested in a different area to evaluate the accuracy on unseen data, which is discussed in “D. Accuracy Assessment”.

C. Fine-Tuning the Pretrained Model

In this section, we explain how to fine-tune a pre-trained 2014 model with the higher, 0.1 m, resolution drone data from 2021. If you want to learn more about fine-tuning, there is a great blog post called “Fine-Tune a Pretrained Deep Learning Model” from Kate Hess and Rami Alouta.

The purpose of the experiment is to see whether outdated data with different characteristics (lidar) can be fine-tuned by using up-to-date photogrammetric data or not. There are pros and cons of lidar and photogrammetric datasets. The main advantage of aerial photogrammetry is its availability and the cost of the flight. One of the biggest problems in remote sensing is keeping the data up-to-date. By fine-tuning the existing models with up-to-date data, researchers and organizations can increase their existing model accuracy and make the models more robust to varying conditions.

The tree samples for 2021 data are taken from different extents to increase the diversity of the samples.

Figure 6: Training, Validation, and Test Area

Train Deep Learning Model Parameters:

· Max Epochs: 20

· Batch Size: 8

· Learning Rate: Empty (Default)

· Backbone Model: RESNET 34

· Validation: 10%

· Stop when model stops improving: Checked

· Freeze Model: Checked

With the parameters given above, the training took around 10 minutes on a personal laptop equipped with a GTX 1060 GPU. Resulting metrics on the automatically picked Validation set:

· Precision: 0.829278

· Recall: 0.880446

· F1-Score: 0.854096

Similar to previous training, the model is tested in a different area to evaluate the accuracy on unseen data, which is discussed in “D. Accuracy Assessment”.

Upsampling 2014 Data

Another question surfaces: “What happens if we upsample the 0.5 m 2014 data into 0.1 m resolution and test the models on it?” For this experiment, we upsample RGB and nDSM rasters using the Resample tool. For resampling, we used the “Nearest” selection. For the upsampled data, we applied the same steps (training and testing). The results are discussed in “D. Accuracy Assessment”.

D. Accuracy Assessment

The validation accuracy for three models is calculated and shown after you use the Train Deep Learning Modeltool; however, high precision values in validation areas might be misleading. We use a designated test area because the validation set is randomly chosen from the training data by the Training tool, i.e., Validation set differs per model. Thus, we need a common Test set. Due to these reasons, we created a new test area that is not used in training and then tested the models in this area.

After doing the training steps above, we have three different models, namely, Low-Resolution Model, High-Resolution Model, and Fine-Tuned Model. To run this tool, we need to do some steps beforehand:

1. Create test data.

2. Classify the pixels using deep learning over the test area.

3. Convert classified rasters to tree polygons.

4. Compute the accuracy.

Creating Test Data

Similar to previous data preparation steps, we created manually drawn ground truth polygons in a different area. There are around 250 polygons in the region. In this experiment, we manually defined the areas for the training by drawing the area masks and using them as processing extent.

Classifying Pixels Using Deep Learning

After training, the models can be used to classify the pixels (binary classification in this case: tree vs. non-tree) in the given data. For the Input Raster, you can select the RGBD raster that you want to classify. For the Model Definition, you can select the model you want to use. In ArcGIS Pro, in the model folder, there should be a “.dlpk” file that represents the deep learning model package. Later, these output rasters are used for accuracy assessment. You can go to the “Environments” tab and select “Processing Extent” as your test area to only classify the pixels in that area to significantly reduce processing time. The resulting rasters from the “Classify Pixel Using Deep Learning tool need to be converted into feature classes (vector) to compute the accuracy.

Raster to Polygon Conversion

After classifying the rasters, they are converted into polygons using the “Raster to Polygontool. We used the default settings for this step. Then, the resulting feature class is cleaned as follows:

1. Remove large polygons: There might be some large polygons occurring during conversion that need to be cleaned. Since the resulting rasters differ, we can’t give an exact threshold value to filter those out.

2. Remove holes: The resulting polygon layer has a field called “gridcode”. The “gridcode” values of 0 represent holes. If you delete them, the feature count decreases dramatically.

3. Remove tiny polygons: There are some tiny polygons resulting from vector conversation that might be good to remove. First, the area for each polygon needs to be calculated. Then, you can use a threshold to remove them as you desire. To do that, open the attribute table of the polygon layer and add a new field (Area) to store the calculated area values. Once you created a new field, using the “Calculate Geometry tool, you can calculate the areas for each polygon in given units. In this experiment, we remove the polygons having an area of less than 1 square meter.

Important Note: This article is written to give users an idea about tree segmentation using the UI. The steps above might be different depending on the experiment and data type. We strongly suggest you evaluate the data, try to understand its characteristics, and do the most appropriate data engineering.

Compute Accuracy for Object Detection

We used the “Compute Accuracy for Object Detection” tool in ArcGIS Pro with different Intersection over Union (IoU) values. According to ArcGIS Pro documentation, “The Intersection over Union (IoU) ratio is used as a threshold for determining whether a predicted outcome is a true positive or a false positive. The IoU ratio is the amount of overlap between the bounding box around a predicted object and the bounding box around the ground reference data.”

Figure 7: Intersection over Union (IoU) https://pro.arcgis.com/en/pro-app/latest/tool-reference/image-analyst/compute-accuracy-for-object-detection.htm

There are five different scenarios tested and results are calculated for 50% and 10% IoU for:

1. 0.5 m 2014 Data (Test with Initial Model)

The initial model trained on a 0.5 m RGBD image is tested in 0.5 RGBD 2014 data on the test area. The validation and the test results are given below:

· Validation Precision: 0.8058

· Validation Recall: 0.7948

· Validation F1-Score: 0.8003

· Test Precision (10% IoU): 0.5000

· Test Recall (10% IoU): 0.2344

· Test F1-Score (10% IoU): 0.3191

There is a significant drop in the test area, which indicates the model might memorize the training data rather than learn.

2. 0.5 m 2014 Data (Test with Fine-Tuned Model)

· Test Precision (10% IoU): 0.5317

· Test Recall (10% IoU): 0.2969

· Test F1-Score (10% IoU): 0.3810

There is a slight improvement in classification when we fine-tune the model with additional polygons from the new data (2021). Yet, the classification results do not look promising on 0.5 m 2014 data.

3. 0.1 m 2021 Data (Test with High-Resolution Model)

We tested the accuracy of the high-resolution model (only trained with 0.1 m 2021 data) on 0.1 m 2021 data. The validation and test results are compared:

· Validation Precision: 0.8776

· Validation Recall: 0.8776

· Validation F1-Score: 0.8547

· Test Precision (10% IoU): 0.7432

· Test Recall (10% IoU): 0.6875

· Test F1-Score (10% IoU): 0.7143

Again, as expected, the accuracy drops slightly at the test site.

4. 0.1 m 2021 Data (Test with Fine-Tuned Model)

We decided to check the accuracy of the fine-tuned model on 0.1 m 2021 data:

· Validation Precision: 0.8293

· Validation Recall: 0.8804

· Validation F1-Score: 0.8541

· Test Precision (10% IoU): 0.7678

· Test Recall (10% IoU): 0.8438

· Test F1-Score (10% IoU): 0.8438

In this experiment, we observe that validation and test accuracies are close to each other. There is not a significant change, which means the model is robust to minor regional changes.

5. 0.1 m 2014 Upsampled Data (Test with Fine-Tuned Model)

In the last experiment, we examine the model performance on the upsampled data. As mentioned previously, 2014 data has 0.5 m resolution and is upsampled into 0.1 m by using “Nearest Neighborhood” interpolation. Then the fine-tuned model is tested on the upsampled data having 0.1 m resolution:

· Test Precision (10% IoU): 0.0148

· Test Recall (10% IoU): 0.3438

· Test F1-Score (10% IoU): 0.0284

The results are very poor in this experiment.

In Table 1, you can see the comparison of the proposed models for different IoU values (0.5 and 0.1):

Table 1: Model Performance Comparison. Click on the table to see it bigger.

The results indicate that the model performing the best is fine-tuned model classifying the 0.1 m 2021 data. There is a slight improvement in the fine-tuned model compared to the 0.1 m model. This way, our experiments showed that a model pre-trained on older and lower-resolution data can still be a better starting point than training a model from scratch on the latest data only.

ACKNOWLEDGEMENTS

We are grateful to the Stadt Zurich Open Data Portal for making the dataset available to the public. Specials thanks to Dmitry Kudinov, Robert Garrity, and Esri Ankara R&D center for their valuable feedback, and contributions.

REFERENCES

[1] Zhao, H., Shi, J., Qi, X., Wang, X., & Jia, J. (2017). Pyramid Scene Parsing Network. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). https://doi.org/10.1109/cvpr.2017.660

[2]https://github.com/hszhao/semseg/blob/7192f922b99468969cfd4535e3e35a838994b115/model/pspnet.py#L29

--

--

Yasin Koçan
GeoAI
Writer for

Senior Product Engineer at Esri Inc. Research of AI applications in remote sensing and photogrammetry. https://www.researchgate.net/profile/Yasin-Kocan