Validation of Crop Pest and Disease Models

Prashun chauhan
Fasal Engineering
Published in
5 min readApr 11, 2022

Crop pests and diseases have the potential to cause devastating epidemics that threaten the world’s food supply. In addition, the cost of pesticides is a substantial burden for growers, with substantial uncertainty involved in deciding when and how much to apply. Crop pest and disease models based on forecasting have the potential to improve the timeliness, effectiveness, and foresight for controlling crop pests and diseases. The ability of these models to reproduce an observed infection pattern needs to be tested and validated using both historical data and observed field data.

Why do we do model validation?

A good model is one that can generate value in real life. And model validation is crucial to ensure its accuracy and stability in the actual scenarios. Crop pest and disease represents one of the largest risks facing the long-term sustainability of agriculture. Crop disease modeling involves a complex interaction of the host, the pathogen, and the micro-climate conditions like temperature, humidity, rainfall, leaf wetness, etc. Without checking and validating the model it is not right to rely on its prediction. The ability of these models to reproduce an observed infection pattern needs to be validated using both historical data for which we have an actual observation.

How do we validate our model?

Let’s understand one problem.

The coffee berry borer is the most serious pest of coffee in India.

Coffee Berry Borer: Hypothenemus hampei

The coffee berry borer model includes a complex algorithm that predicts the risk of the pest on a given day. A very general way of doing model validation is to compare the prediction from your pest/disease model (predicted results) and the data we observe directly in the field (actual results).

Since, pest or disease risks are classified as “High”, “Medium” or “Low”, that makes all pest/disease models a classification model. For classification models like the coffee berry borer model, model accuracy is validated using a confusion matrix.

Understanding the Confusion Matrix

A Confusion Matrix is an N x N matrix, where N is the number of target classes, that gives a comparison between actual and predicted values. It is not a metric to evaluate a model but provides insight to evaluate the accuracy of a classification model. Let’s see a 2 x 2 confusion matrix:

Confusion Matrix (2 x 2)

A 2 x 2 dimension confusion matrix is used for the binary classification or when the classification model has two different prediction classes, i.e., ‘Yes’ (pest/disease occurrence) and ‘No’ (no pest/disease occurrence). First, let’s understand the terms which will help in analyzing the accuracy of your model and hence, validating it:

1: True positive (TP): The number of instances when the model predicted “Yes” and pest/disease occurrence was also observed in actual.

2. True negative (TN): The number of instances when the model predicted “No” and no pest/disease occurrence was observed in actuality.

3. False positive (FP): The number of instances when the model predicted “Yes” but no pest/disease occurrence was observed in actuality. In this case, the model wrongly predicted a pest/disease incidence. It is also known as Type I error.

4. False negative (FN): The number of instances when the model predicted “No” but pest/disease occurrence was observed in actuality. In this case, the model missed predicting an actual pest/disease incidence. It is also known as Type II error.

Let’s look at an example disease model data:

Confusion Matrix Observation for a crop disease model

In the above example, TP is 60, TN is 9000 whereas FP is 140 and FN is 40. The lower the values of FP and FN are, the better the accuracy of your model will be.

Analysing accuracy using the above calculation

Accuracy in a classification model shows how many of the predictions are correct. With the help of the confusion matrix shown in the example we can calculate the accuracy or percentage of model predictions that were correct using the below formula:

Accuracy= (TN+TP)/(TP+TN+FP+FN) = 9060/9240=98.05%

However, accuracy doesn’t tell what percentage of pest or disease predictions were actually true? Nor does it tell us what percentage of actual pest or disease incidences were correctly predicted by the model? Therefore, to answer these questions, we can also calculate two other extremely important model evaluation metrics, i.e, Precision and Recall.

Precision calculates the number of positive class predictions that actually belong to the positive class, i.e., the percentage of pest or disease predictions that were actually true.

Precision= (TP)/(TP+FP)=60/(60+140)=30%

Recall calculates the number of positive class predictions made out of all positive examples in our data, i.e., the percentage of actual pest or disease incidences that were correctly predicted by the model.

Recall =(TP)/(TP+FN)=60/(60+40)=60%

In the above example, 60% of the total disease incidences were correctly predicted by the model and 30% of the disease predictions were true. Since, the data is unbalanced, where most of the actual incidences observed were of no disease (~97.4%), high accuracy alone isn’t enough to validate the model performance. Precision and recall, help in better understanding the performance of the model. There is a list of other metrics also available in Python’s scikit-learn library to validate a model’s performance. Depending upon the requirements, different metrics can be used for model validation.

The above example of a 2 x 2 confusion matrix was used for simplicity purposes. The actual output for a coffee berry borer model has 3 different classes, i.e., “High”, “Medium” and “Low”. Therefore, we use a 3 x 3 confusion matrix for model validation purposes. In the multi-class classification problem, we won’t get TP, TN, FP, and FN values directly as in the binary classification problem. We need to calculate for each class.

Conclusion

Every model has different metrics to evaluate its performance and accuracy. The confusion matrix helps in analyzing these metrics for the validation of crop pest and disease models. However, there is not an optimal and easy-to-find choice for model validation. It is important to clearly define the requirements and choose a metric based on these requirements.

--

--