How we tackled plant images segmentation problem with Deep Learning.
Identifying weed plants in cultivated field is a time-intensive task. Automating such a task can drastically reduce the time needed to identify and remove weeds and thus increase the yield and the productivity of the workers. In our latest project, my colleague Rawane Madi and I worked on a plant segmentation task that uses Deep Learning to identify weeds and their stems from crops. We will share with you in this article our journey starting from finding relevant datasets, choosing the proper neural network architecture, training the model, evaluating its performance, to finally launching it.
Domain Data Availability
Our first step was to find datasets that match the segmentation task in our hands. After finding several options online, we decided to:
- Find a large enough labeled dataset since our task is delicate and our labeling resources are limited.
- Discard images shot in aerial view by UAVs since we needed a close shot of the crops to be able to detect their stems.
- Discard the Plant Field Weed Image dataset (too small with 60 annotated images only) and Open Sprayer Images dataset (not labeled at the time).
Our choice finally landed on the Sugarbeet dataset which turned out to be the most convenient for our task. The images were taken by a robot on a sugar beet farm in Germany for 3 months. Crops were photographed from their emergence date until the advanced growth stage. The dataset was labeled for crop and weeds but not for stems. We first trained our models on the Sugarbeet dataset and then on another custom dataset with images that were labeled for crop, weeds, and stems.
During our literature review, we found two interesting image segmentation papers:
- The first paper trains a neural network model for pose regression to generate a plant location likelihood map. The stems are then extracted from the heat map with high centimeter accuracy.
- The second paper uses a novel joint model architecture based on FC-DenseNet. In a segmentation task, the encoder generally produces a compressed but information-packed representation of the input while the decoder upsamples the representations to the original input size with pixel-wise predictions. But in this case, the encoder data volume is connected to two decoders. The plant decoder produces plant features, determining whether a pixel is soil, plant, dicot weeds, or grass weeds. And the stem decoder does the stem detection for the plant-weed region.
Masks Generation and Data Encoding for Models
The datasets were annotated by color. Occluded objects were segmented as well. The annotation of those images consisted of red polygons for crops, green polygons for weeds, and blue circles for stems. Generating masks consisted of generating a 2D matrix with the same image size that contains the labels of the three classes.
To generate a mask, the following steps were applied:
- Load the corresponding image
- Iterate over each channel of the image
- Associate non-zero pixel values in the red channel to 1, corresponding to crops
- Associate non-zero pixel values in the green channel to 2, corresponding to weeds
- Associate non-zero pixel values in the blue channel to 3, corresponding to stems
- If a specific pixel has zero values for all 3 channels then give a value of 0, corresponding to the background
In the case of the joint model, labels were split between two masks. The first type of mask, corresponding to the first branch, would contain labels 0, 1, 2 with the same significance as above. The second type of masks, corresponding to the second branch, would contain labels 0,1,2 where 2 has the same purpose of label 3 above.
We have followed the same approach for both model architectures:
- Inputting a 3D RGB image of size 128x128
- Applying preprocessing by normalizing the image colors.
- Augmenting the dataset by changing image orientation horizontally and vertically (Other augmentation techniques such as brightness change distorted the masks so caution must be taken)
- Randomly splitting each dataset between training (90%) and validation (10%). The total number of images from the Sugarbeet dataset was 11,552 and the number of images in the custom dataset was 4,693.
- Using Keras with Tensorflow backend for the entire model including data generators.
Further research led us to a U-Net model trained on the Sugarbeet dataset (link). Even though the model is only trained for weed detection, we used the weights to transfer the learning to the entire task.
We found other weights for the ResNet 101 model. We believed that ResNet might be a good candidate because it uses residual blocks known to preserve spatial information during the encoding-decoding process. This model was originally trained on ImageNet dataset, so we replaced the last layers with convolutional layers and transferred the learning to our task on the custom dataset.
Model with Stems as additional channel
We decided to go with the most straightforward approach by adding an additional channel for the stems. We will have 4 channels in the output image: Background, Crop, Weed and Stem. Code for U-Net and ResNet implementations are found in our Github repository.
This model implementation is inspired by the previously mentioned paper by Lottes et Al . The model consists of two outputs and one input (check previous diagram). The first output is the crop/weed mask and the second output is the stem of the weed plant. Therefore, two types of masks need to be generated to cater for each output branch.
The joint model approach separates the identification of the crops/weeds from stem detection while still leveraging the learning from the first task to inform the second task. Hence, instead of sequencing the two tasks, it combines them into one by taking advantage of one encoder. Code for U-Net, ResNet, and FC-Densenet joint models are also available on our Github repository.
We relied on the following metrics to test the performance of our models:
- Mean IoU or Intersection over Union, is a common measure for image segmentation. It compute the Intersection over Union for each separate class and then it averaged over the number of classes. The higher the value of the mean IoU, the better.
- We define stem accuracy as the ratio of number of times a correct stem is predicted by the model over the actual number of existing stems.
We trained U-Net model for 360 epochs and ResNet model for 200 epochs with mean IoU and accuracy as training metrics. For roughly the same number of epochs, U-Net outperformed ResNet in IoU and stem accuracy. The final weights of the models can be found here.
Results on Custom Dataset
After training the U-Net model for 360 epochs on the custom dataset, the average mean intersection over union (mean IoU) evaluated on 90% of the custom dataset was 0.6 . The stem accuracy evaluated on 90% of the custom dataset was 0.8 .
In order to improve our model performance even further, here are some directions we want to further investigate:
- Acquire more labeled images
- Experiment with the joint model approach
- Test the model on completely new datasets that the model hasn’t trained on before
NB: The work described in this blog was developed by Zaka as a project for a client. Zaka does not own the Intellectual Property for the ideas described in this blog.
Don’t forget to support with a clap!
Do you have a cool project that you need to implement? Reach out and let us know.
To discover Zaka, visit www.zaka.ai
Subscribe to our newsletter and follow us on our social media accounts to stay up to date with our news and activities: