Tree Segmentation on RGB and nDSM Rasters Using Pyramid Scene Parsing Network (PSPNet) in ArcGIS Pro. — Part 2: Using ArcGIS API for Python

Published in

GeoAI

11 min readAug 19, 2022

Tree samples represented in yellow on RGBD data

In Part 1, “Tree Segmentation on RGB and nDSM Rasters Using Pyramid Scene Parsing Network (PSPNet) in ArcGIS Pro. — Part 1: Without a single line of code!”, we showed how to create training data, train a deep- learning model, and evaluate its performance using the ArcGIS Pro user interface. In Part 2, we demonstrate how to run the same pipeline using ArcGIS API for Python.

Introduction

Recent innovations in deep-learning applications have led it to become popular in the geospatial domain. It is often used in segmentation, classification, 3D reconstruction, and many other tasks. Although geospatial deep-learning requires some computing power and training samples, once a model has been pre-trained with a sufficient amount of data, the model can be directly used for further work without doing the training step again. The advantage of these models is that they can be fine-tuned with additional labeled samples. For example, suppose you have a model trained to detect trees in Zurich in 2014 data. In that case, the model might perform poorly on Zurich 2021 data due to seasonal changes, sensing methods (photogrammetry or lidar), and the sensing device properties that are directly related to the spatial resolution. To overcome the problem, the pre-trained model can be fine-tuned with the samples from the 2021 imagery so it performs better in detecting the trees in the 2021 imagery. By doing so, the information from the 2014 data is preserved and fine-tuned with the up-to-date samples. For more information, check what we have done in Part 1.

Setting Up the Environment

In Part 1, we already explained how to install deep-learning libraries for ArcGIS Pro. You can also check how to install using Deep Learning Libraries Installer for ArcGIS.

ArcGIS Pro comes with a Conda in it where you can clone the base environment. Before you start installing the libraries, you can create a clone of the base environment in two ways:

1. Using ArcGIS Pro User Interface:

Click the Project tab on the top left corner in ArcGIS Pro, and open the Python tab. Here you can go to the Manage Environment button in Python Package Manager, and clone the base environment by clicking the Clone Default button or you can select a specific environment, and clone it by clicking the Clone button in the selected environment as shown in Figure 2.

Figure 1: ArcGIS Pro Python Package Manager

Figure 2: Cloning environment in Python Package Manager

2. Using Conda Prompt

You can also check the environment directory in Python Package Manager. You might need it if you are cloning using Conda Prompt clone. You can copy the Conda Prompt as well. For more information, check the official Conda Documentation Cloning Environment in the Conda section. There is also a detailed explanation on ArcGIS Support Page about cloning. Remember that ArcGIS Pro has a built-in Conda so before you start, be sure that you run the ArcGIS Pro Python Command Prompt to clone or edit the environments.

conda create --clone <environment to clone>  --name <new environment name>

Libraries

The libraries used for the tree segmentation tutorial are listed below. To use these libraries, you should have already installed the Deep Learning Libraries for ArcGIS.

The libraries used in this experiment

The Pipeline

In this tutorial, we do the following:

Export the training data, setting the desired output cell size.
Create validation and test sets.
Define the model and calculate the learning rate.
Train the model using PSPNet.
Classify a raster using this model.

Some functions may require different module licenses, such as Spatial Analyst or Image Analyst. To see the necessary licenses for the specific function, locate to ArcGIS Pro Geoprocessing Documentation page.

The parameters for the functions used are explained briefly in code snippets. If you want to dive deeper, check ArcGIS API for Python Documentation.

1. Export the Training Data

In this step, we take RGBD raster and tree polygons as input and create training chips using the tree boundaries. For information about Export Training Data for Deep Learning, check the documentation.

Export training data for deep-learning

Generate Python Script from User Interface

ArcGIS Pro has a cool feature that you can select a geoprocessing tool and set the parameters, then Pro can give you the Python script of the tool that you want to run. To do that, when in the Geoprocessing History panel, right-click an item and select Copy Python Command.

2. Create Validation and Test Sets

Batch Size

The number of image tiles the GPU can analyze simultaneously when inferencing is referred to as batch size in machine learning. During inferencing and training, the image is divided into tiles, and the batch size refers to how many tiles the GPU can infer at once. Reduce the batch size if you have out-of-memory issues with the utility.

Validation and Training Sets

We used the same tree polygons (you can check Part 1 to learn how we create the tree vectors) to create the training samples. The image chips (454 in our case) are split into training and validation sets. “val_split_pct” is used to define the percentage of the image chips that are used as a validation set. In our experiment, we used 20% of the data as validation (90 chips) and 80% for training (364). You can see the code snippet and the output below:

Preparing the data for training

Train: LabelList (364 items)
x: ArcGISSegmentationItemList
ArcGISMSImage (4, 256, 256),ArcGISMSImage (4, 256, 256),ArcGISMSImage (4, 256, 256),ArcGISMSImage (4, 256, 256),ArcGISMSImage (4, 256, 256)
y: ArcGISSegmentationLabelList
ArcGISImageSegment (1, 256, 256),ArcGISImageSegment (1, 256, 256),ArcGISImageSegment (1, 256, 256),ArcGISImageSegment (1, 256, 256),ArcGISImageSegment (1, 256, 256)
Path: C:\Medium\TrainingData\Chips256\images;

Valid: LabelList (90 items)
x: ArcGISSegmentationItemList
ArcGISMSImage (4, 256, 256),ArcGISMSImage (4, 256, 256),ArcGISMSImage (4, 256, 256),ArcGISMSImage (4, 256, 256),ArcGISMSImage (4, 256, 256)
y: ArcGISSegmentationLabelList
ArcGISImageSegment (1, 256, 256),ArcGISImageSegment (1, 256, 256),ArcGISImageSegment (1, 256, 256),ArcGISImageSegment (1, 256, 256),ArcGISImageSegment (1, 256, 256)
Path: C:\Medium\TrainingData\Chips256\images;

3. Define the Model and Calculate the Optimal Learning Rate

Learning rate, one of the hyper-parameters of the training, describes how our network’s weights are adjusted in relation to loss gradient descent. It determines how quickly or slowly we will approach the ideal weights. If the learning rate is too large or too small, the model is unable to learn in most cases due to oscillations in the plot. It’s generally good to keep the learning rate small (0.01 or 0.001 etc.); however, the smaller the learning rate, the more time the training takes. There is an excellent explanation in Stanford University Deep Learning for Computer Vision lecture notes. With the GPU available at the moment (GTX 980 Ti), we used a batch size of 5. We recommend using higher batch sizes if your GPU allows it. Batch size also has an effect on the learning rate, but within the scope of the article, we recommend trying to maximize the batch size as much as your GPU allows.

Figure 3: An illustration on the left shows how differing learning rates can have an impact. Improvements will be linear at low learning rates. High learning rates will cause them to appear more exponential. Higher learning rates will cause the loss to decay more quickly, but they become stuck at worse loss values (green line). On the right, a typical loss function over time can be seen. https://cs231n.github.io/neural-networks-3/

We used a PSPNet in this experiment. The details are given in Part 1. You can check the ArcGIS API for Python PSPNet Classifier section for more information. In the code snippet below, we define the model using the PSPNetClassifier function. Then, we use a built-in utility to find an optimal starting learning rate and plot the corresponding curve. You can change the learning rate by checking the loss graph or using it as calculated.

The model definition and calculation of the optimal learning rate

4. Train the Model Using PSPNet

The next step is to train the model, and check the “train_loss”, “valid_loss”, “accuracy”, “dice”, and “time” for each epoch. After evaluating the results, you can save the model for new data classification. For more information about the fit function, you can check its official documentation.

Training the model using PSPNet

Once the model is trained, you should see the results of each epoch as shown below:

Training loss, validation loss, accuracy, dice, and time over the epochs

Then you can plot the loss function and the results using the code block below:

Plotting the loss graph

Figure 4: Training and validation results

When an entire dataset is cycled forward and backward through the neural network once, it is referred to as an epoch. In the case of a few epochs of training, the model may underfit. Similarly, if we train the model for thousands of epochs, it may overfit. In other words, if the number of epochs increases gradually, the model goes through underfitting to optimal, then overfitting with a high number of epochs. The number of epochs needed for optimal training of a model depends on multiple factors, including the network graph itself; loss function; optimizer; hyperparameters; and, of course, the training data. The best way to decide the optimum number of epochs is to check the loss graph to understand whether it’s an underfit, a good fit, or an overfit. This is covered more in the “Overfitting, Underfitting, and a Good fit” chapter below.

Fine-Tuning the Pre-Trained Model

As we showed in the results in Part 1, fine-tuning a pre-trained model (initially trained with 2014 data, and later fine-tuned with 2021 data) gives more accurate results than a model trained with 2021 data only. In this case, we have a pre-trained model that is trained with relatively low-resolution imagery (0.5 m per pixel), which is later fine-tuned with higher resolution imagery (0.1 m per pixel).

To get the best classification result from inferencing, the training data should be consistent with the region of interest for the task. The low-resolution training samples could be used for training and then classification for high-resolution imagery. However, due to spatial-resolution change, the low-resolution pre-trained model performs poorly in high-resolution data inferencing. In addition, not only the spatial resolution changes here but also the data characteristics vary. To be more specific, 2014 data is produced from lidar whereas 2021 data is generated from drone imagery that is where there is no active lidar sensor involved in the source data collection. Hence, the nDSMs (Normalized Digital Surface Models) derived from different sensing methods show different behavior.

ArcGIS Living Atlas offers various pre-trained deep-learning models that can be used directly; nevertheless, these models might not perform perfectly due to the variations in the region of interest and the data as explained above. Rather than training a model from scratch, these available pre-trained models can be fine-tuned with the training data prepared considering your area of interest. For example, a model trained in Zurich to detect trees might not perform well in Ankara due to different tree characteristics; yet, still can be used as a great starting point. Since preparing and labeling training data is one of the most tedious tasks in deep-learning applications, using a pre-training model, and then fine-tuning it saves a lot of time.

Fine-tuning the pre-trained model

Overfitting, Underfitting and a Good Fit

The goal of this experiment is to train a robust model that can perform well on unseen data. The question is how are we going to be sure that the model is working fine? First, we should evaluate whether it’s overfitting, underfitting, or a good fit. If the model overfits the data, validation loss starts going up, while the training loss keeps going down or remains stable. This is exactly how you see that the model starts losing the generalization abilities towards unseen data (validation set). In that case, the model might learn the data characteristic including noise or data-specific features that might not generalize the case. So, the model might perform poorly, and cause a high variance between ground truth and predictions. On the contrary, underfitting occurs when the model is too simple for the problem (for example, the number of trainable parameters is not sufficient), or trained not long enough and/or with suboptimal hyperparameters like too small of a learning rate. The objective is to obtain a suitable fit such that the model recognizes the patterns in the training data without memorizing the finer details. In a good fit, we expect the model to perform well both in validation and test data. The validation set is picked by the Python API automatically whereas in the test set we can manually run the Compute Accuracy for Object Detection tool in a controlled way across the different models. In addition, having separate test sets allow users to evaluate their model performance in different regions, such as residential and rural. To be more specific, in this experiment, we have samples spread over Zurich to train the model. Within the scope of the experiment, we were interested in detecting the trees over the rooftops rather than isolated tree samples. Hence, the test site is selected where there are trees hanging over the rooftops, and the model is tested on the test site. You can use the following procedure to check model performance:

Figure 5: Evaluating the model. https://www.v7labs.com/blog/overfitting

You can check the chart below to see deep-learning illustrations of the loss graphs.

Figure 6: Graphical representation of underfitting, good fit and overfitting. https://www.kaggle.com/getting-started/166897

5. Classify a Raster Using This Model

After evaluating the loss graph, if the model results in a good fit, the next step is to test its performance on unseen data. In addition to validation results, we selected a region that is excluded from training to see the trained model performance on this test site. The inference might take a while depending on the GPU you have, so it’s recommended to classify a small portion of the data to see if the model behaves as expected. If you don’t encounter any problem, then you can use the whole extent to be processed.

Classify the raster using the model

The classification output is a binary raster that shows the pixels classified as trees. If you have ground-truth data in the test site, to compute the model accuracy on a test data, the raster result should be converted into a vector feature layer. You can use Raster to Polygon function for the conversion. After having prediction and ground truth data as vector features, you can use these polygons to check the overlap between the predicted and ground truth data.

As a result, the model initially trained with 2014 data that is later fine-tuned with 2021 data performs better than the model trained from scratch with 2021 data on the test site. The comparison is done with the Compute Accuracy for Object Detection tool. After fine-tuning, the F1-Score is increased from 0.7143 to 0.8438 which means that using pre-trained models can be used as a starting point.

Accuracy assessment

ACKNOWLEDGEMENT

Special thanks to Dmitry Kudinov and Robert Garrity for the valuable feedback.