Cassava Crop Counting

10 min readOct 11, 2021

Previous post: Cassava Semantic Segmentation

Following my previous work, I wonder what is another feature that would be useful to field owner. After some conversation with agronomists and farmers, I have decided to tackle the crop counting problem. This feature will be useful to field owner because it will allow them to generally extrapolate their yield based on their current standing crop.

Preprocessing

Field Boundary

The first thing I set out to do is draw the field boundary on the raster which I will use as a mask to include crop count results. This will allow farmer to draw a polygon and define which area he/she wants to count crops. I used QGIS to draw the polygon and save it as a vector file in GeoJSON format. In the world of GIS, polygons are defined by points described in latitude & longitude that are based on the raster’s CRS, but in the world of computer vision we work in pixel-coordinate, so I also had to write a simple function to convert between each world.

Raster Tiling

Rasters are normally huge, for example this raster is over 10,000 pixels in width and height, so I cut the raster up into tile of size 512 by 512 pixel which makes it more manageable. To keep track of things, the name of each tile references its position in the image (y_x.jpg) For example the first tile at the top left corner of the raster will be named 0_0.jpg, the second tile to the right will be 0_1.jpg, and the tile underneath the first tile will be 1_0.jpg.

So, a tile image will have its own local coordinate system while all the tiles are linked together by a global-coordinate system. Keeping track of all tiles with this type of coordinate system will be useful for later operations and ultimately will be used to stitch the tiles and their results back into a raster.

Cassava Crop Annotation

Next, I go ahead and start labeling the crop in the image tile. What I do is represents individual Cassava plant as a single circular Gaussian curve with the co-variance matrix (spread of the Gaussian curve) equal to its radius. The Gaussian curve is also normalized so the range is 0 to 1 from the base to the peak. The size of the Gaussian curve is also multiplied by alpha, which control how much wider or thinner it is relative to radius of the Cassava plant.

To label each image by hand is labor intensive so I create an algorithm to help me perform this task. To label each plant, I need to define its center local-coordinate and its radius. I experimented with different image processing techniques and have created a general method that I will describe below. Starting with an image tile with Cassava crops

1.Convert the image tile from RGB to 1 channel grayscale where the conversion use different index calculation to highlight green leaves for example the Triangular Greenness Index (TGI) formula was used.

2. Run the blob detector algorithm. I had to play around with the hyper-parameter to get the optimal results. The output of the blob detector is a list of blobs defined by x, y, and r in local coordinate.

3. Lastly, I create a heatmap populated by Gaussian curves at the location of the detected blob. Below is visualization of the heatmap from 2D and 3D perspective.

This process can be a little tedious and I often have to go back and double check the results but overall, I’m able to annotate images at a rate not possible without using the above method.

Model Architecture & Training Setup

The overall architecture of my model is based on the U-Net, which is composed of download sampling path and the up-sampling path but without a skipping layer. I have also made the model much smaller by reducing the number of filters and removing one down-sampling block.

Below is the summary of my model. I kept the model very simple and very small because, I have to train everything from scratch. This is to prevent overfitting. The last layer of my model outputs a heatmap with same dimension as the input image. I also apply the sigmoid operation to all the pixel in the output to get a heatmap with values ranging from 0 to 1. I implemented this model using PyTorch.

Dataset Preparation

There was a total of 1215 annotated image tile made up of 3 fields {A, B, C}. Field A was a multispectral while field B and C were RGB. Multispectral camera often has lower resolution than RGB camera and its rendered coloration and contrast is also different from its RGB counterpart. All 3 fields have the same flight configuration. For test set, a different field was selected, and 51 images were labeled by hand.

To prevent similar data from the training set appearing in the validation set, raster field A and C are separated by region for training and validation.

Training Setup

I used MSE loss function and used the Adam optimizer with starting learning rate at 1E-4 and all its hyperparameter set to default. Training batch size is 16. I also implemented data-augmentation during loading where the transformation included: horizontal flip, vertical flip, rotation, contrast adjustment, brightness adjustment, and saturation adjustment. Each transformation has a 50% chance of occurring. The contrast, brightness, and saturation have an adjustment factor of 0.1

Evaluation Metrics

All the image tiles are passed through the model, where they will have their own corresponding heatmap. The next step is to convert these heatmap into x-y coordinates crop location by using the peak finder algorithm. I used the peak finder algorithm from Scikit-Learn called “peak_local_max” which returned list of peaks in x and y local coordinate.

I have to define a custom evaluation metric for this problem to measure the algorithm’s performance in meaningful way. First thing to note this that a simple accuracy count will not do, some localization metric is needed to make sure that the model correctly identifies crops. This can be framed as an assignment problem where the predicted crops location is matched to the ground truth. I used the Hungarian algorithm to solve this assignment problem. I first create a cost matrix where the row are the x-y coordinates of the prediction while the column are the x-y coordinates of the ground truth. Each element within the cost-matrix is the L2 Euclidean distance between prediction and ground truth. The pairing is chosen based on the smallest distance. True positive are the successful pairings that do not exceed the maximum threshold while any matched pairs with the distance over the threshold are false positive. Any unmatched ground truths are false negative.

Results

Below is a plot that track the MSE loss of training and validation during training. It is a good sign that no overfitting occurred as the validation loss follow a downward trajectory with the training loss. It is interesting to see that in the early stages of the training, the validation loss is quite far from the training loss but towards 400k iteration, the model began to learn to extract features that generalize to the validation dataset.

Below are some sample image tiles from the test set field that has been passed through the model where their predicted heatmap are displayed

With the evaluation metric defined above, I converted the predicted heatmap to peak x-y coordinates. Solve the assignment problem to yield true positive, false positive, and false negative. Lastly, I calculate the recall, precision, and F1-score. The performance of the algorithm is reported in the table below.

Post-Processing

Tile’s Edge Double Count Problem

The next step is combining all identified peaks (or crop localization estimates); however, a problem arises at the edges of each tile. Consider a raster below, it is made up of 2 tiles sized 512 by 512. There are a total of 16 Cassava plants.

If there is a Cassava plant that is positioned in between two tiles, that Cassava plant will be split into two appearing in both tiles. For example, the left tile below counts 9 plants while the right tile counts 10 plants. This total to 19 plants, 3 extra plant mistakenly counted. Remember that each tile is processed independently to one another by the U-net model, so one crop appearing in two tiles can accidentally be found twice.

To solve this problem, I created a separate module that extract out crops at the edge of each tile in local coordinates then compare them with crops from the neighboring tile. From a global coordinate perspective, crops on the edge that are counted twice will have a very abnormally small Euclidean distance between them. To make this module fully automated, the challenge lies in extracting points that are considered at the edge of the tile. How far away from the edge of a tile must you be to not be considered as an edge points, which are all susceptible to being double counted.

This becomes even harder when taking into account that different camera yield different resolution so the size of the crop can also vary drastically. Below is an example of 2 tiles (both 512 by 512) coming from a multispectral camera (Micasense) on the left and on the right from RGB camera (Zenmuse X5S). Both cameras were fixed to the same drone flying the same flight. I have highlighted the same white balloon on the ground to give you a perspective of how different the resolutions are.

What I know is the size of a Cassava crop cannot exceed the distance between two crops within a row. So, if I set my edge threshold to be the averaged distance between two crops intra-row, then I will be able to extract out crops on edge safely irrespective of image resolution. (This mean intra-crop distance is also used as the maximum threshold for the assignment problem in the evaluation metric)

Below is a sample output from my algorithm that goes through each point within the tile and search for the closest neighboring point, which should be the neighboring intra-row crop. I then average those distance to get my edge-threshold

Correcting Neighboring Tiles

Each tile has 4 edges {top, bottom, right, and left} which means that all extracted points on all 4 edges need to be cross check with 4 other neighboring tiles. The comparison relationship is summarized in the table below and the diagram will help with visualization.

This can be chaotic if no global coordinates for all tiles are being tracked, but luckily it is. What I did is create a tile-sweeping algorithm moving from left to right. During the sweep, it will generate a graph linking each tile to its neighbors in the right and down direction only. The algorithm will check if the neighboring tile exist, if it is in the field’s boundary and if there are any edge points. The end result is a graph that I use to keep track of which tile’s edge has been double checked and updated.

Below is an example of a tile before correction and after correction where 4 duplicates are removed.

Creating Final Raster

The last step is to plot the points onto the raster map using recorded global coordinates. However, this is in pixel-coordinate system, so I wouldn’t be able to visualize this raster on GIS-based software or platform, which means I have to perform geo-referencing operation on my output. I used the GDAL module along with CRS-extent information in my Orthomosaiced raster. After this conversion, I can upload my map online.

This project would not have been possible without the help from Chaipat Suwannapoom and Danuwasin Sittiworachat. They helped with all aspects of the project from R&D to implementation. Thank You!

Next post: Cassava Crop Line Detection

Cassava Crop Counting

Written by Natthasit Wongsirikul