Cassava Semantic Segmentation

6 min readOct 11, 2021

Previous post: Thailand Cassava Smart Farming Using Drone-Based Imaging

This is the 2nd part of my Cassava drone imaging analytics series. In previous post, I explained what technology we used to capture the data and what image processing techniques we used to process that data. In this post, I will explain how we use deep learning to help with some analytical work to help farmers monitor their field with more precision.

Data Annotation

I am trying to do a fine grain analysis on the NDVI of the Cassava plant only, I do not want to include the NDVI values of the soil or other vegetations into my calculation. So, I am doing semantic segmentation to create a mask indicating which pixel in the map belongs to Cassava plant.

Of all the 3 major computer vision tasks (mage classification, object detection, and semantic segmentation), performing image annotation for semantic segmentation is the most laborious. However, when looking at the dataset at hand, I realized that it is not practical to hand annotate ariel image of Cassava plant. So, I need to find a way to perform this annotation task more efficiently.

Traditional Semantic Segmentation Annotation Task

Unconventional Semantic Segmentation Annotation Task

To start, I performed semantic segmentation labeling but at a region level where I drew polygon indicating area where there is Cassava crop and area where there is other vegetation such as weed, trees, or other crops. I didn’t differentiate different classes of other types of vegetation because there is an issue of class imbalance among non-crop vegetations. There are 3 output classes in this segmentation task: Cassava, other-veg, and void. I used an opensource labeling tool called Labelme.

What I realized is I can use multispectral images to highlight the vegetation out. First, I create a mask excluding all region where the NDVI value is below a certain threshold. This threshold is calculated by flattening out the NDVI image into an NDVI distribution where the threshold is set at the 40th percentile. This will separate out the soil with weed and grasses from the larger and denser plants such as trees and crops.

Next, I created an exclusion mask using the near-infrared reflectance value that is below 0.2. What this does is remove shadowed area. Shadows can cause discrepancy and may introduce error down the road during analysis.

Lastly, I combine the shadow mask and the NDVI mask.

Now that I have a mask selecting only the vegetation part of the image, I can combine this mask layer with the region-label. Recall that the region map is a label that assign rough area where the Cassava crop is and where other non-crop vegetation is. Using the AND operation to combine the region-layer and the vegetation mask, will result in fine-grain semantic segmentation label of cassava plant.

In addition to the aligned multispectral images, an orthomosaiced multispectral map was also included in the dataset where the map is cut up into a tile sized 512x512. The resolution and detail of these tiles is lower and may include some warpage as a result of the image stitching algorithm.

Dataset Splitting

The entire dataset is made up of 6 separate plantation field. One of these field has been flown twice but at a different time (1 month later). There are a total 896 images that has been annotated from the dataset, below is a table describing the strategy to split the dataset up into training, validation, and test set.

Input images which are in JPG format, can be either RGB or CIR so 2 models was trained separately. This is to see if CIR image embed more information about different types of plant and may allow the model to discriminate between Cassava and other vegetation.

Model Architecture

The model used is an Encoder-Decoder with skip layers. The encoder extract features at different spatial-resolution while down-sampling the feature-map, which allows the model to learn features that can be used to discriminate each pixel into their respective classes. The model is composed of multiple Convolutional block which consist of a conv2D, Batch-Norm, and ReLU. It downsamples the feature-map using the max-pooling layer. The decoder upsample the feature map back up to the same dimension as the input image. It is also composed of convolutional blocks, but it switches max-pooling layer to transpose convolution layer. Along the down-sampling part, each resultant feature-map from the max-pooling layer is reserved and later concatenated to its associated up-sampling block. See the diagram illustrating the model architecture.

Model Training

Model is trained using in batch-size of 4 where a simple random data-augmentation such as vertical and horizontal flip was done during dataloading. The loss function used was softmax cross entropy loss with logits. The optimizer used was RMS propagation algorithm with learning rate at 1.0E-4 and decay factor at 0.995.

Results

Below is a validation loss, accuracy, and IoU per each epoch of the RGB model during training

A comparison table of model performance on the test set is shown below comparing RGB & CIR model and some sample out. CIR model outperformed the RGB model by a bit.

Conclusion

What is useful about having a pixel-based class label of a Cassava plantation field is that I can create a mask that extract out Cassava only pixels from the entire raster. This will allow me to do a more fine-grain analysis. For example, I can now calculate the NDVI value based on only Cassava pixels excluding soil, sands, body of water, and other vegetation. In addition, I can also use this mask to make visualization of crop health more clear.

Next post: Cassava Crop Counting

Cassava Semantic Segmentation

Written by Natthasit Wongsirikul