A model for predicting the production yield of tomatoes

Trying to improve an already existing model using R — first shot

22 min readDec 13, 2023

In this blog post I am describing my work on trying to improve an already implemented model for predicting tomato yield. It is a commercial dataset, which I cannot share unfortunately, and the variables are in Dutch (that should not kill the fun). However, what should be feasible to show is how we can assess, and possibly improve, an already existing model.

The improvement of the “older model” may not even be the most important outcome — perhaps newer machine learning models really cannot beat this model, and so it is best to just automate it and determine when it does (and when it does not) offer useful predictions.

The model is not a statistical model, but a model built on the expertise of the professional. That makes it extremely valid, because even if it is a model that is not the most accurate, it is already accepted by the user (who is also the creator). And a model that is accepted is a usable model.

When working in the field of data science you seldom start with a blank slate. Even if there is no model there is always information that must be harvested. In this case, we have a model, so for me the trick is to get to talk to the creator, and also look at how the model performs and when it over-or underperforms.

In this blog post I will show you where I am now with the model, the steps I have made and the thoughts I have. The post is not specifically attuned to anything, but rather to show the thinking that accompanies a real example. And real examples are not that easy to find.

Data wrangling and exploration

First of all, we will need to do some data wrangling and see what it is that we have to deal with. First, I will load the libraries and then you can see me show what kind of variables are included. As I mentioned, they are in Dutch, but for me it is about showing the structure and the dimenions of the dataset. It is a rather sparse dataset and it has some additional challenges which I will adress later. For now, lets load the libraries and do some wrangling.

#### Remove environment ----
rm(list = ls())
#### Load libraries ----
library(readxl)
library(ggplot2)
library(tidyverse)
library(lubridate)
library(caret)
library(doParallel)
library(lattice)
library(pdp)
library(naniar)
library(stringr)
library(tsibble)
library(fable)
library(fabletools)

Then off to do some wrangling. We have various variables, like week, date, temperature, number and the size of the fruits that will bear the tomatoes etc.

df <- read_excel("OogstPrognose 2022_2023.xls",sheet = "Sheet1")
df$year<-as.factor(year(df$Datum))
df$month<-as.factor(month(df$Datum))
df$week<-as.factor(week(df$Datum))
colnames(df)

colnames(df)
 [1] "Week"                           "Datum"                          "MaanPeriode"                   
 [4] "GemetenEtmaalTemp."             "HistorischeEtmaalTemperatuur"   "GecorrigeerdeEtmaalTemperatuur"
 [7] "AantalGezettevruchten"          "GemiddeldVruchtGewicht"         "kg/m2"                         
[10] "TotaalHa"                       "Totaalkg"                       "Prognosekg"                    
[13] "Prognosekg/m2"                  "Maancorrectie"                  "Werkelijkkg"                   
[16] "Verschil%"                      "CumulatiefPrognosekg"           "CumulatiefWerkelijkkg"         
[19] "CumulatiefVerschil%"            "year"                           "month"                         
[22] "week"   

structure(df)
A tibble: 156 x 22
    Week Datum               MaanPeriode GemetenEtmaalTemp. HistorischeEtmaalTemperatuur GecorrigeerdeEtmaalTe~1 AantalGezettevruchten
   <dbl> <dttm>                    <dbl>              <dbl>                        <dbl>                   <dbl>                 <dbl>
 1     1 2022-01-03 00:00:00           1               16.5                         16                      13.5                   5.4
 2     2 2022-01-10 00:00:00           2               16.3                         16                      13.3                   3.3
 3     3 2022-01-17 00:00:00           3               16.1                         16.1                    13.1                   6.5
 4     4 2022-01-24 00:00:00           4               16.2                         16.1                    13.2                   3.8
 5     5 2022-01-31 00:00:00           1               16                           16.2                    13                     6  
 6     6 2022-02-07 00:00:00           2               16.4                         16.6                    13.4                   3.3
 7     7 2022-02-14 00:00:00           3               17.4                         17                      14.4                   6  
 8     8 2022-02-21 00:00:00           4               16.9                         17                      13.9                   8.7
 9     9 2022-02-28 00:00:00           1               19                           17.5                    16                    12.3
10    10 2022-03-07 00:00:00           2               19.2                         18                      16.2                  14  
# i 146 more rows
# i abbreviated name: 1: GecorrigeerdeEtmaalTemperatuur
# i 15 more variables: GemiddeldVruchtGewicht <dbl>, `kg/m2` <dbl>, TotaalHa <dbl>, Totaalkg <dbl>, Prognosekg <dbl>,
#   `Prognosekg/m2` <dbl>, Maancorrectie <dbl>, Werkelijkkg <dbl>, `Verschil%` <dbl>, CumulatiefPrognosekg <dbl>,
#   CumulatiefWerkelijkkg <dbl>, `CumulatiefVerschil%` <dbl>, year <fct>, month <fct>, week <fct>
# i Use `print(n = ...)` to see more rows

Lets plot the data to get a better grip on it. One of the predictors is the moon, and I have no idea how exactly the variable called moon correction is calculated. But it IS part of the model, and for a reason, so already curious to see how this variable behaves in future models.

ggplot(data=df)+
  geom_point(aes(x=Datum, y=Maancorrectie, col="Moon correction"))+
  geom_line(aes(x=Datum, y=Maancorrectie, col="Moon correction", group=1))+
  theme_bw()+
  theme(legend.position = "bottom")+
  labs(x="Date", 
       y="Moon correction", 
       title="Predictors", 
       col="Metric")

It seems that the moon correction variable only has a couple of possible values that deviate over time. Since we have weekly data, it is mostly likely that these values represent the lunar cycle.

Lets look at temperature as well — there are multiple temperature variables included but we will pick one, which is the corrected temperature per 24 hours. That may sound a bit strange because the data are weekly values, but these are most likely averages (in degrees Celcius).

ggplot(data=df)+
  geom_point(aes(x=Datum, y=GecorrigeerdeEtmaalTemperatuur, col="Temperature"))+
  geom_line(aes(x=Datum, y=GecorrigeerdeEtmaalTemperatuur, col="Temperature", group=1))+
  theme_bw()+
  theme(legend.position = "bottom")+
  labs(x="Date", 
       y="Temperature", 
       title="Predictors", 
       col="Metric")

Corrected temperature per 24 hours (averaged across the week). Since this tomato's are grown in glass houses, the temperature does not have to resemble seasonality. Rather, the temperature is controlled as best as possible to support the plants in growing tomatos.

Now it is time to look at the production data itself. So, what I will show now are two things:

The weekly total KGs of tomato's harvested
The prognosis of this weekly yield

Of course, from this, we can also look at the predicted versus the observed values and asses how well the model from the farmer performs. These predictions are made quite some time in advance, although I am not entirely sure how much in advance. Also, farmers have a natural tendency to adjust their farming practices to their environment, meaning that they will step up or slow down if necessary. This is something that is almost impossible to predict on a very short notice.

Nevertheless, this post is not about perfecting an existing model, or beating an existing model, but to assess the model for its worth and dig deeper on the component of which it is made up. Data science is not always about improvement, or discovering the unknown. Often, it is already worthwhile to just get some more information on how things are related.

ggplot(data=df)+
  geom_point(aes(x=Datum, y=Werkelijkkg, col="Observed"))+
  geom_line(aes(x=Datum, y=Werkelijkkg, col="Observed", group=1))+
  geom_point(aes(x=Datum, y=Prognosekg, col="Predicted"))+
  geom_line(aes(x=Datum, y=Prognosekg, col="Predicted", group=1))+
  geom_bar(aes(x=Datum, y=Prognosekg-Werkelijkkg), 
           fill="black", stat="identity", alpha=0.3)+
  theme_bw()+
  theme(legend.position = "bottom")+
  labs(x="Date", 
       y="Production (kg)", 
       title="Tomato production",
       subtitle="Observed vs Predicted", 
       col="Metric")

A graph showing the tomato production as observed and as predicted with the bar charts showing the difference. As you see, the model is not bad, but what remains unclear is how much the model is already being used to intervene.

If we want, we can also plot the data in a more faceted way — using year to split it apart.

ggplot(data=df)+
  geom_point(aes(x=Datum, y=Werkelijkkg, col="Observed"))+
  geom_line(aes(x=Datum, y=Werkelijkkg, col="Observed", group=1))+
  geom_point(aes(x=Datum, y=Prognosekg, col="Predicted"))+
  geom_line(aes(x=Datum, y=Prognosekg, col="Predicted", group=1))+
  geom_bar(aes(x=Datum, y=Prognosekg-Werkelijkkg), fill="black", stat="identity", alpha=0.3)+
  theme_bw()+
  theme(legend.position = "bottom")+
  facet_grid(year~"Yield", scales="free")+
  theme(legend.position = "bottom")+
  labs(x="Date", 
       y="Production (kg)", 
       title="Tomato production",
       subtitle="Observed vs Predicted", 
       col="Metric")

Now, one of the things I learned in the past when dealing with predicting production data, is that it is sometimes better to predict it in a cumulative form. By plotting the cumulative data, the growth and curve of the harvest become more clear and the predictions also make more sense on a larger scale. This is because production is an optimization proces — it cannot always run full charge. For instance, if you want your plants to keep producing as much as possible, the plant will wear out and start to produce less. And reproduction is a costly endeavor. Hence, similar to predicting the feed intake of swine, I will take a closer look at the data using cumulative sums.

ggplot(data=df)+
  geom_point(aes(x=Datum, y=CumulatiefWerkelijkkg, col="Observed"))+
  geom_line(aes(x=Datum, y=CumulatiefWerkelijkkg, col="Observed", group=1))+
  geom_point(aes(x=Datum, y=CumulatiefPrognosekg, col="Predicted"))+
  geom_line(aes(x=Datum, y=CumulatiefPrognosekg, col="Predicted", group=1))+
  geom_bar(aes(x=Datum, y=CumulatiefPrognosekg-CumulatiefWerkelijkkg), fill="black", stat="identity", alpha=0.3)+
  theme_bw()+
  theme(legend.position = "bottom")+
  facet_grid("Yield"~year, scales="free")+
  labs(x="Date", 
       y="Cumulative Production (kg)", 
       title="Tomato production",
       subtitle="Observed vs Predicted", 
       col="Metric")

The cumulative production as realized and as predicted.

Looking at the data like this, one could argue that the model is actually doing quite well. However, and this is always the issue with cumulative data, any difference is bound to be large. Because it is cumulative. So, yes the lines overlap, but they are by no means perfect. You only need to take a look at the scale of the y-axis to see this.

Hence, time to take a step back and look at changes across both years from a monthly and weekly perspective. Clear to see from the previous plots is the bigger-than-previous yield in 2023.

df%>%
  group_by(year, month)%>%
  summarise(gem_werkelijk = mean(Werkelijkkg, na.rm=TRUE))%>%
  ggplot()+
  geom_bar(aes(x=month, y=gem_werkelijk, fill=year), position="dodge", stat="identity")+
  theme_bw()+
  labs(x="Month", 
       y="Production (kg)", 
       title="Tomato production",
       fill="Year")

The year 2023 is already turning out to be a great year.

It will be difficult to predict 2023 from 2021 and 2022 by just using statistics, but the model from the farmer already showed that it should be doable. Hence, lets continue with our exploration and look per week. These are weeks of the year, not the first week of harvest or anything else.

df%>%
  ggplot()+
  geom_bar(aes(x=week, y=Werkelijkkg, fill=year), position="dodge", stat="identity")+
  theme_bw()+
  labs(x="Week", 
       y="Production (kg)", 
       title="Tomato production",
       fill="Year")

Once again clear to see that 2021 and 2022 showed quite some overlap but for 2023 things are a bit different.

Okay, so I guess we have seen what are dealing with. Time to venture towards the modelling part. The goal is clear: to assess, and if possible, improve the current prediction model for tomato yield.

Modelling

When modelling, I like to start with a heavy ML model just to see how far it will bring me. A random forest or XGboost using vanilla settings will offer me a glimpse of how much math I can use to get the job done. By no means are these my favourite models — to me, they are like nukes killing a fly, but if a nuke cannot kill a fly we have a serious challenge on our hands. That is what I am trying to assess.

Because we are dealing with time-series data, I cannot just apply cross-validation: the value are connected. So I will use a time-slice method meaning that I will use a sliding window to assess the model. This way, I can see if certain parts of the data make it difficult for me to predict.

myTimeControl <- trainControl(method = "timeslice",
                              initialWindow = 12,
                              horizon = 12,
                              fixedWindow = FALSE,
                              allowParallel = TRUE)
df_comp<-df%>%select(Datum, month, week, MaanPeriode, 
                     GemetenEtmaalTemp.,
                     HistorischeEtmaalTemperatuur, 
                     GecorrigeerdeEtmaalTemperatuur, 
                     AantalGezettevruchten,Werkelijkkg, 
                     GemiddeldVruchtGewicht, 
                     TotaalHa,
                     Maancorrectie,
                     Prognosekg)%>%drop_na()

# A tibble: 69 x 13
   Datum               month week  MaanPeriode GemetenEtmaalTemp. HistorischeEtmaalTemper~1 GecorrigeerdeEtmaalT~2
   <dttm>              <fct> <fct>       <dbl>              <dbl>                     <dbl>                  <dbl>
 1 2022-03-21 00:00:00 3     12              4               19                        18.5                   16  
 2 2022-03-28 00:00:00 3     13              1               18.3                      18.5                   15.3
 3 2022-04-04 00:00:00 4     14              2               17.1                      18.5                   14.1
 4 2022-04-11 00:00:00 4     15              3               19.1                      19                     16.1
 5 2022-04-18 00:00:00 4     16              4               19.5                      19                     16.5
 6 2022-04-25 00:00:00 4     17              1               18.9                      19                     15.9
 7 2022-05-02 00:00:00 5     18              1               19.3                      19.5                   16.3
 8 2022-05-09 00:00:00 5     19              2               19.8                      19.5                   16.8
 9 2022-05-16 00:00:00 5     20              3               18.5                      19.5                   15.5
10 2022-05-23 00:00:00 5     21              4               19.5                      19.5                   16.5
# i 59 more rows
# i abbreviated names: 1: HistorischeEtmaalTemperatuur, 2: GecorrigeerdeEtmaalTemperatuur
# i 6 more variables: AantalGezettevruchten <dbl>, Werkelijkkg <dbl>, GemiddeldVruchtGewicht <dbl>,
#   TotaalHa <dbl>, Maancorrectie <dbl>, Prognosekg <dbl>
# i Use `print(n = ...)` to see more rows

Then I will of course create a train set (no test set, yet). I will try and model the weekly tomato yield. So, NOT cumulative yield, but weekly yield.

trainX <- df_comp%>%select(-c(Werkelijkkg, Datum, Prognosekg))
trainY <- df_comp[,"Werkelijkkg",drop=TRUE]
dim(trainX); length(trainY)
[1] 69 10
[1] 69

And then the vanilla random forest model. As you can see from the output, we really do not have much data to work with. And we are including 10 predictors nonetheless. Of course, random forests are known to handle such data quite easily, learning from previous decision trees as they build up new ones. Hence, why I like it so much as my first nuclear option.

rf.mod <- train(x=trainX,
                y=trainY,
                method = "rf",
                trControl = myTimeControl,
                metric='RMSE')
rf.mod

Random Forest 

69 samples
10 predictors

No pre-processing
Resampling: Rolling Forecasting Origin Resampling (12 held-out with no fixed window) 
Summary of sample sizes: 12, 13, 14, 15, 16, 17, ... 
Resampling results across tuning parameters:

  mtry  RMSE      Rsquared   MAE     
   2    34099.68  0.6472561  26789.63
   6    35878.97  0.6413870  28010.78
  10    36860.70  0.5980553  29015.38

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 2.

Then I can take a look at which variables are deemed more important. Considering the time-series nature of the data, it should not come as a surprise that week and month rank highest, followed by temperature as measured.

plot(varImp(rf.mod))

We can also assess the RMSE which, to be honest, does not tell me that much if not compared with another model. Therefore, lets calculate the RMSE ratio of the models by first estimating:

the RMSE for the model made by the farmer
the RMSE for the model made by me

If my model is better, the ratio will be more then one. If it worse, it will be less than one. As you can see, it is more than one.

df_comp<-df_comp%>%cbind(., pred=predict(rf.mod))
sqrt(mean((df_comp$Werkelijkkg - df_comp$Prognosekg)^2)) / sqrt(mean((df_comp$Werkelijkkg - df_comp$pred)^2))
[1] 1.437169

I can also visualize my model, and the model of the farmer, by showing the raw predictions and by showing the raw differences. The purple bar charts show the differences between the observed values and my model. They are not that bad, except for 2023. And this is not even a test set, so had it been, my model would have been even worse. Also, 2023 is really different than the two years before. This will make it more difficult to use that year as a test set (or test year).

ggplot(df_comp)+
  geom_point(aes(x=Datum, y=Werkelijkkg, col="Observed"))+
  geom_line(aes(x=Datum, y=Werkelijkkg, col="Observed", group=1))+
  geom_point(aes(x=Datum, y=Prognosekg, col="Old model"))+
  geom_line(aes(x=Datum, y=Prognosekg, col="Old model", group=1))+
  geom_bar(aes(x=Datum, y=Prognosekg-Werkelijkkg), fill="black", stat="identity", alpha=0.3)+
  geom_bar(aes(x=Datum, y=pred-Werkelijkkg), fill="purple", stat="identity", alpha=0.3)+
  geom_point(aes(x=Datum, y=pred, col="RandomForest"))+
  geom_line(aes(x=Datum, y=pred, col="RandomForest", group=1))+
  theme_bw()+
  theme(legend.position = "bottom")+
  labs(x="Date", 
       y="Production (kg)", 
       title="Tomato production",
       subtitle="Observed vs Predicted", 
       col="Metric")

Lets see what happens if I built a 2022 model to predict 2023. Remember, 2023 is a different year with a a different curve. So I am expecting that my model will do worse than in the previous years. A lot worse.

#### Modelling 2022 to predict 2023 ---
myTimeControl <- trainControl(method = "timeslice",
                              initialWindow = 5,
                              horizon = 5,
                              fixedWindow = FALSE,
                              allowParallel = TRUE)
df_train<-df_comp%>%filter(year=="2022")
trainX <- df_train%>%select(-c(Werkelijkkg, Datum, year, Prognosekg))
trainY <- df_train[,"Werkelijkkg",drop=TRUE]
dim(trainX); length(trainY)
rf.mod <- train(x=trainX,
                y=trainY,
                method = "rf",
                trControl = myTimeControl,
                metric='RMSE')
rf.mod

Random Forest 

26 samples
 9 predictor

No pre-processing
Resampling: Rolling Forecasting Origin Resampling (5 held-out with no fixed window) 
Summary of sample sizes: 5, 6, 7, 8, 9, 10, ... 
Resampling results across tuning parameters:

  mtry  RMSE      Rsquared   MAE     
  2     22512.83  0.2984249  20937.88
  5     22153.40  0.3236742  20625.52
  9     21934.12  0.2988279  20234.89

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 9.

plot(varImp(rf.mod))

As you can see, the importance of month has dropped considerably. Which make sense, since we only have one value per month.

We can once again do a ratio calculation. This time, it is even better than before, with a RMSE that is twice as low for our model than the model made by the farmer. This is not what I expected.

df_train%>%
  mutate(RMSE = sqrt(mean((.$Werkelijkkg - .$Prognosekg)^2)) / sqrt(mean((.$Werkelijkkg - .$pred)^2)))%>%
  summarise(averageRMSE = mean(RMSE))
  averageRMSE
1    2.343666

We can of course also plot our results to see where we did better exactly. First for the training set.

ggplot(df_train)+
  geom_point(aes(x=Datum, y=Werkelijkkg, col="Observed"))+
  geom_line(aes(x=Datum, y=Werkelijkkg, col="Observed", group=1))+
  geom_point(aes(x=Datum, y=Prognosekg, col="Old model"))+
  geom_line(aes(x=Datum, y=Prognosekg, col="Old model", group=1))+
  geom_bar(aes(x=Datum, y=Prognosekg-Werkelijkkg), fill="black", stat="identity", alpha=0.3)+
  geom_bar(aes(x=Datum, y=pred-Werkelijkkg), fill="purple", stat="identity", alpha=0.3)+
  geom_point(aes(x=Datum, y=pred, col="RandomForest"))+
  geom_line(aes(x=Datum, y=pred, col="RandomForest", group=1))+
  theme_bw()+
  theme(legend.position = "bottom")+
  labs(x="Date", 
       y="Production (kg)", 
       title="Tomato production - 2022",
       subtitle="Observed vs Predicted", 
       col="Metric")

The results on the training set. Looking good almost overall!

And then the test set.

df_test<-df_comp%>%
  filter(year=="2023")%>%
  select(-year)%>%
  cbind(., pred=predict(rf.mod, newdata = .))

df_test%>%
ggplot(.)+
  geom_point(aes(x=Datum, y=Werkelijkkg, col="Observed"))+
  geom_line(aes(x=Datum, y=Werkelijkkg, col="Observed", group=1))+
  geom_point(aes(x=Datum, y=Prognosekg, col="Old model"))+
  geom_line(aes(x=Datum, y=Prognosekg, col="Old model", group=1))+
  geom_bar(aes(x=Datum, y=Prognosekg-Werkelijkkg), fill="black", stat="identity", alpha=0.3)+
  geom_bar(aes(x=Datum, y=pred-Werkelijkkg), fill="purple", stat="identity", alpha=0.3)+
  geom_point(aes(x=Datum, y=pred, col="RandomForest"))+
  geom_line(aes(x=Datum, y=pred, col="RandomForest", group=1))+
  theme_bw()+
  theme(legend.position = "bottom")+
  labs(x="Date", 
       y="Production (kg)", 
       title="Tomato production - 2023",
       subtitle="Observed vs Predicted", 
       col="Metric")

We can also see the performance drop in the RMSE ratio showing that our RMSE is now twice as large as that of the model from the farmer. Overall, we predict less than the actual values, which makes sense, since that year has seen a flying start which could have been predicted from the past year.

> sqrt(mean((df_test$Werkelijkkg - df_test$Prognosekg)^2)) / sqrt(mean((df_test$Werkelijkkg - df_test$pred)^2))
[1] 0.4207684

Extending the train set and list of vanilla models

Perhaps the Random Forest is simply not a good model to use. So, one option to take is to build several other vanilla models to see if there is a better one. We should also include 2021 in the training set (we did not do that in the previous exercise). Perhaps this will give us what we need to predict better for 2023.

df_train<-df_comp%>%filter(year%in%c("2021","2022"))
trainX <- df_train%>%select(-c(Werkelijkkg, Datum, year, Prognosekg))
trainY <- df_train[,"Werkelijkkg",drop=TRUE]
dim(trainX); length(trainY)

[1] 55  9
[1] 55

set.seed(7)
control = trainControl(method = 'repeatedcv',
                       number = 10,
                       repeats = 3,
                       returnResamp = "all",
                       savePredictions = "all")
rf.mod <- train(x=trainX,
                y=trainY,
                method = "rf",
                trControl = control,
                metric='RMSE')
lm.mod <- train(x=trainX,
                y=trainY,
                method = "lm",
                trControl = control,
                metric='RMSE')
glmnet.mod <- train(x=trainX,
                y=trainY,
                method = "glmnet",
                trControl = control,
                metric='RMSE')
nn.mod <- train(x=trainX,
                y=trainY,
                method = "nnet",
                trControl = control,
                metric='RMSE')
results <- resamples(list(RF=rf.mod,
                          LM=lm.mod,
                          GLMNET=glmnet.mod, 
                          NN=nn.mod))
summary(results)


Call:
summary.resamples(object = results)

Models: RF, LM, GLMNET, NN 
Number of resamples: 30 

MAE 
            Min.   1st Qu.    Median     Mean   3rd Qu.     Max. NA's
RF      3138.883  5646.562  7282.007  7371.89  8447.708 12040.70    0
LM      7008.175 11097.864 16786.467 16434.01 20957.548 26875.06    0
GLMNET  9666.052 14697.595 16471.773 16372.09 18048.150 21748.72    0
NN     52673.400 62385.833 64395.643 64843.59 69268.188 73093.80    0

RMSE 
            Min.   1st Qu.    Median     Mean  3rd Qu.     Max. NA's
RF      3395.947  7993.842  9067.048  8814.61 10418.70 14036.87    0
LM      7984.132 13153.885 19540.006 18722.74 23030.02 29414.09    0
GLMNET 10687.228 17003.295 19276.321 18466.10 20599.17 25114.35    0
NN     60884.323 67904.766 69361.079 69884.27 72861.68 75959.27    0

Rsquared 
              Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
RF     0.604806013 0.8806545 0.9315836 0.8900139 0.9636651 0.9806561    0
LM     0.003274298 0.3762019 0.5177649 0.5465201 0.8009565 0.9487782    0
GLMNET 0.159203540 0.4819199 0.6068079 0.5946585 0.7392137 0.9485112    0
NN              NA        NA        NA       NaN        NA        NA   30

From the results I can see that the Neural Net does not work (not surprising giving the low amount of data). Best to plot the results.

bwplot(results)

The random forest beats the other models by a long shot.

dotplot(results)

Same results, different way of showcasing.

So, the Random Forest model seems to be the best vanilla choice for now. We can, and need to, visualize our results to see where and how the model performed best (and worst).

ggplot(df_train)+
  geom_point(aes(x=Datum, y=Werkelijkkg, col="Observed"))+
  geom_line(aes(x=Datum, y=Werkelijkkg, col="Observed", group=1))+
  geom_point(aes(x=Datum, y=Prognosekg, col="Old model"))+
  geom_line(aes(x=Datum, y=Prognosekg, col="Old model", group=1))+
  geom_bar(aes(x=Datum, y=Prognosekg-Werkelijkkg), fill="black", stat="identity", alpha=0.3)+
  geom_bar(aes(x=Datum, y=pred-Werkelijkkg), fill="purple", stat="identity", alpha=0.3)+
  geom_point(aes(x=Datum, y=pred, col="RandomForest"))+
  geom_line(aes(x=Datum, y=pred, col="RandomForest", group=1))+
  theme_bw()+
  theme(legend.position = "bottom")+
  labs(x="Date", 
       y="Production (kg)", 
       title="Tomato production - 2021 & 2022",
       subtitle="Observed vs Predicted on the train set", 
       col="Metric")

The 2021 and 2022 sets are not that bad. Our model is for sure better than the model by the farmer.

Now, we must also see if our performance holds if we look at the test set, which is 2023.

df_test<-df_comp%>%
  filter(year=="2023")%>%
  cbind(., pred=predict(rf.mod, newdata = .))

df_test%>%
  filter(week<20)%>%
  ggplot(.)+
  geom_point(aes(x=Datum, y=Werkelijkkg, col="Observed"))+
  geom_line(aes(x=Datum, y=Werkelijkkg, col="Observed", group=1))+
  geom_point(aes(x=Datum, y=Prognosekg, col="Old model"))+
  geom_line(aes(x=Datum, y=Prognosekg, col="Old model", group=1))+
  geom_bar(aes(x=Datum, y=Prognosekg-Werkelijkkg), fill="black", stat="identity", alpha=0.3)+
  geom_bar(aes(x=Datum, y=pred-Werkelijkkg), fill="purple", stat="identity", alpha=0.3)+
  geom_point(aes(x=Datum, y=pred, col="RandomForest"))+
  geom_line(aes(x=Datum, y=pred, col="RandomForest", group=1))+
  theme_bw()+
  theme(legend.position = "bottom")+
  labs(x="Date", 
       y="Production (kg)", 
       title="Tomato production - 2023 for March till mid-May",
       subtitle="Observed vs Predicted on the test set", 
       col="Metric")

df_train<-df_train%>%cbind(., pred=predict(rf.mod))

sqrt(mean((df_train$Werkelijkkg - df_train$Prognosekg)^2))
[1] 16425.74

sqrt(mean((df_train$Werkelijkkg - df_train$pred)^2))
[1] 4273.46

df_train%>%
  mutate(RMSE_Ratio = sqrt(mean((.$Werkelijkkg - .$Prognosekg)^2)) / sqrt(mean((.$Werkelijkkg - .$pred)^2)))%>%
  summarise(RMSE_Ratio = mean(RMSE_Ratio))
  RMSE_Ratio
1   3.843663

From the values and the graph, it seems that our model is doing better than the model from the farmer. Our RMSE is almost four times less than the original model.

Like I said, I consider the Random Forest to be too much of a nuke. Therefore I am very curious to see how many other models we could create considering that we have time-series data on our hands. Perhaps, therefore, we should return to the nature of the data and build time-series models instead. They are, however, not the easiest models to use.

Deploying multiple time-series models

This is of course not the first time I have deployed time-series models. They are not easy to handle and require quite some pre-processing considering their various needs and assumptions. Nevertheless, my goal is to get a feeling of how far traditional time-series models can take me.

Time-series model do not like missing data and so we must replace the missing data that we have. Lets see how our training set looks if we start filling in the gaps.

attr(df_comp, 'interval') <- tsibble::new_interval(week = 1)

master_data_tbl<-df_comp%>%
  select(Datum,Werkelijkkg)%>%
  mutate(ID="1", 
         Datum=as.Date(Datum))%>%
  distinct()%>%
  tidyr::complete(Datum = seq.Date(min(Datum), max(Datum), by="week"))%>%
  tsibble::as_tsibble(key = ID,
                      index = Datum)%>%
  select(-ID)

train_tbl <- master_data_tbl%>%
  filter_index(. ~ "2022-09-12")%>%
  dplyr::mutate(Werkelijkkg = replace_na(Werkelijkkg, 0))

test_tbl <- master_data_tbl%>%filter_index("2022-09-12" ~ .)%>%drop_na()

ggplot()+
  geom_line(data = train_tbl, 
            aes(x=Datum, y=Werkelijkkg, col="Train"))+
  geom_point(data = train_tbl, 
            aes(x=Datum, y=Werkelijkkg),col="black")+
  geom_line(data = test_tbl, 
            aes(x=Datum, y=Werkelijkkg,col="Test"))+
  geom_point(data = test_tbl, 
            aes(x=Datum, y=Werkelijkkg), col="black")+
  theme_bw()+
  labs(x="Week", 
       y="Production (kg)", 
       title="Train & Test dataset depicted", 
  col="")

The train and test set clearly show that dealing with harvest data, and imputing the values with zero, will create an artificial dataset that is very difficult to analyze. Time-series models in general have issues dealing with intermittent data.

So, what will happen if we run several time-series models on this data?

model_table <- train_tbl %>%
  model(
    naive_mod    = NAIVE(Werkelijkkg + 1),
    snaive_mod   = SNAIVE(Werkelijkkg + 1),
    drift_mod    = RW(Werkelijkkg + 1 ~ drift()),
    ses_mod      = ETS(Werkelijkkg + 1 ~ error("A") + trend("N") + season("N"), opt_crit = "mse"),
    hl_mod       = ETS(Werkelijkkg + 1 ~ error("A") + trend("A") + season("N"), opt_crit = "mse"),
    hldamp_mod   = ETS(Werkelijkkg + 1 ~ error("A") + trend("Ad") + season("N"), opt_crit = "mse"),
    arima_mod    = ARIMA(Werkelijkkg + 1),
    dhr_mod      = ARIMA(Werkelijkkg + 1 ~ PDQ(0,0,0) + fourier(K=2)),
    tslm_mod     = TSLM(Werkelijkkg + 1 ~ 1),
    crost_mod    = CROSTON(Werkelijkkg + 1),
    sba_mod      = CROSTON(Werkelijkkg + 1, type = "sba")) %>%
  mutate(ensemble_sm_mod = combination_ensemble(arima_mod, 
                                                ses_mod, 
                                                crost_mod, sba_mod))

# A mable: 2 x 13
# Key:     ID [2]
  ID       naive_mod   snaive_mod     drift_mod      ses_mod       hl_mod    hldamp_mod      arima_mod
  <chr>      <model>      <model>       <model>      <model>      <model>       <model>        <model>
1 1     <NULL model> <NULL model>  <NULL model> <NULL model> <NULL model>  <NULL model>   <NULL model>
2 NA         <NAIVE> <NULL model> <RW w/ drift> <ETS(A,N,N)> <ETS(A,A,N)> <ETS(A,Ad,N)> <ARIMA(0,0,0)>
# i 5 more variables: dhr_mod <model>, tslm_mod <model>, crost_mod <model>, sba_mod <model>,
#   ensemble_sm_mod <model>

Nothing. Clearly, this is not the way to go. I cannot just create zero’s where there are none, and the data is too sparse to create meaningful patterns. In summary, I do not like the needs of the traditional time-series models and so we must take a different approach. It is time to look deeper into the meaning of the harvest.

Looking deeper into harvest

In the previous plot you could already see what it means to have time-series data split across harvest and non-harvest time. Below you can see the production per week of the year for three years. Clear to see is that harvest, for each year, starts at the same week (sort of) and ends at the same week.

ggplot(data=df)+
  geom_bar(aes(x=week, y=Werkelijkkg), fill="grey", stat="identity")+
  theme_bw()+
  theme(legend.position = "bottom")+
  facet_grid(year~"Production")+
  labs(x="Date", 
       y="Production (kg)", 
       title="Tomato production",
       subtitle="Looking for harvest periods", 
       col="Metric")

Although we could imagine the weeks when there is no harvest, it might be good to include that data as well.

df%>%
  select(Datum,year, week, Werkelijkkg)%>%
  mutate(Datum=as.Date(Datum))%>%
  distinct()%>%
  tidyr::complete(Datum = seq.Date(min(Datum), max(Datum), by="week"))%>%
  mutate(Harvest = if_else(!is.na(Werkelijkkg), "Yes", "No"))%>%
  dplyr::mutate(Werkelijkkg = replace_na(Werkelijkkg, 10000))%>%
  drop_na()%>%print(n=200)
  ggplot()+
  geom_bar(aes(x=week, y=Werkelijkkg, fill=Harvest), stat="identity")+
  theme_bw()+
  theme(legend.position = "bottom")+
  facet_grid(year~"Production")+
  labs(x="Week", 
       y="Production (kg)", 
       title="Tomato production",
       subtitle="Looking for harvest periods", 
       col="Metric")

Once again, it seems that the start and stop of the harvest are somewhat the same each year, give or take a couple of weeks. Do note that for 2023, we have two harvest weeks but no harvest yield. I decided to keep the data as it is but am not sure how much it affected the predictions.

The week of the year is a pretty good anchor, but even better would be to define harvest weeks. We can already see that the harvest weeks are not equal across the three years so lets anchor them together. This will also allow us to do a better comparison across years, and across the weeks within a harvest.

df%>%
  select(Datum,year, week, Werkelijkkg)%>%
  drop_na()%>%
  group_by(year)%>%
  mutate(Harvest_week = row_number())%>%
  ggplot()+
    geom_line(aes(x=Harvest_week, y=Werkelijkkg, col=year))+
    theme_bw()+
    theme(legend.position = "bottom")+
    labs(x="Harvest week", 
         y="Production (kg)", 
         title="Tomato production",
         subtitle="Looking for growth", 
         col="Metric")

This plot gives quite a good impression of what we are dealing with — harvest yields take time to reach a peak after which they slowly decrease. At the end, there seems to be a final-yield moment resulting in a peak. Once again, the 2023 year seems absurd.

The above plot showed the raw yield. Lets also take a look at the cumulative yield.

df%>%
  select(Datum,year, week, Werkelijkkg)%>%
  drop_na()%>%
  group_by(year)%>%
  mutate(Harvest_week = row_number(), 
         Harvest_cum = cumsum(Werkelijkkg))%>%
  ggplot()+
  geom_line(aes(x=Harvest_week, y=Harvest_cum, col=year))+
  theme_bw()+
  theme(legend.position = "bottom")+
  labs(x="Harvest week", 
       y="Cumulative production (kg)", 
       title="Tomato production",
       subtitle="Looking for growth", 
       col="Metric")

The plot below is even more impressive highlighting an almost parallel line between 2021 and 2022. And then there is of course that deviating 2023 year.

The biggest problem with modelling time-series data, even when using traditional time-series models, is that the date variables has enormous predictive abilities.

Not only can we determine a slope, a cycle, or seasonality, but we can also split timeseries data in weeks, months, quarters etc. Hence, the date variable becomes a very informative proxy, but it remains difficult if not impossible to derive informed decisions from it.

No farmer will aim for a specific production yield because it is April the 17th. No, instead, the yield of that particular day is based on the past yield and on the current (as in NOW) circumstances. That is the difficulty of predicting farm management— the farmer has eyes and ears and knowledge that is not easily captured in a model.

Hence, lets see how strong our model is if we do not include date or time. A correlation plot on the raw data is always a nice start. Given the sparsity of the data, I will not use any splits between years or months.

df_corr<-df%>%select(MaanPeriode, 
                     GemetenEtmaalTemp.,
                     HistorischeEtmaalTemperatuur, 
                     GecorrigeerdeEtmaalTemperatuur, 
                     AantalGezettevruchten,
                     Werkelijkkg, 
                     GemiddeldVruchtGewicht, 
                     TotaalHa,
                     Maancorrectie)%>%
  drop_na()
df_corr_matrix<-round(cor(df_corr), 1)
ggcorrplot::ggcorrplot(df_corr_matrix)

The correlation plot shows quite some heavy correlations between production yield and temperature and the number of fruits that can be harvested.

Considering this correlation matrix, it makes sense to build a model including predictors that have some biological meaning. This means we will not include time or date. I will use again the moving time-window for cross-validation.

df_comp<-df%>%
  select(MaanPeriode,
         GecorrigeerdeEtmaalTemperatuur, 
         AantalGezettevruchten, 
         GemiddeldVruchtGewicht,
         Maancorrectie, 
         CumulatiefWerkelijkkg, 
         Datum, 
         year, 
         CumulatiefPrognosekg)%>%
  group_by(year)%>%
  mutate(Harvest_week = row_number())%>%
  ungroup()%>%
  drop_na()
myTimeControl <- trainControl(method = "timeslice",
                              initialWindow = 5,
                              horizon = 5,
                              fixedWindow = FALSE,
                              allowParallel = TRUE)

colnames(df_comp)
str(df_comp)
gg_miss_var(df_comp)

df_train<-df_comp%>%filter(year=="2022")
trainX <- df_train%>%select(-c(CumulatiefWerkelijkkg, Datum, year, CumulatiefPrognosekg, Harvest_week))
trainY <- df_train[,"CumulatiefWerkelijkkg",drop=TRUE]
dim(trainX); length(trainY)
[1] 26  5
[1] 26

And of course stick to the vanilla Random Forest.

rf.mod <- train(x=trainX,
                y=trainY,
                method = "rf",
                trControl = myTimeControl,
                metric='RMSE')

Random Forest 

54 samples
 5 predictor

No pre-processing
Resampling: Rolling Forecasting Origin Resampling (5 held-out with no fixed window) 
Summary of sample sizes: 5, 6, 7, 8, 9, 10, ... 
Resampling results across tuning parameters:

  mtry  RMSE      Rsquared   MAE     
  2     496070.0  0.2751845  469790.8
  3     489151.9  0.2579826  461127.9
  5     505878.9  0.2609430  475273.9

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 3.

plot(varImp(rf.mod))

Temperature and weight stick out — again.

Lets take a look at the RMSE ratio. Clear to see that our RMSE is about five times higher than that of the old model — the model of the farmer.

df_train<-df_train%>%cbind(., pred=predict(rf.mod))
df_train%>%
  mutate(RMSE = sqrt(mean((.$CumulatiefWerkelijkkg - .$CumulatiefPrognosekg)^2)) / sqrt(mean((.$CumulatiefWerkelijkkg - .$pred)^2)))%>%
  summarise(averageRMSE = mean(RMSE))
  averageRMSE
1   0.1968245

If we do the analysis separately, we can get a glimpse of how big the RMSE is for both models. Of course the RMSE values explode towards the end as the cumulative values increase, but then again, if we are interested in cumulative yield, we should also measure the RMSE on that metric. Exploding or not.

sqrt(mean((df_train$CumulatiefWerkelijkkg - df_train$CumulatiefPrognosekg)^2))
[1] 33921.19
sqrt(mean((df_train$CumulatiefWerkelijkkg - df_train$pred)^2))
[1] 172342.3
sqrt(mean((df_train$CumulatiefWerkelijkkg - df_train$CumulatiefPrognosekg)^2)) / sqrt(mean((df_train$CumulatiefWerkelijkkg - df_train$pred)^2))
[1] 0.1968245

And then there are the plots.

ggplot(df_train)+
  geom_point(aes(x=Datum, y=CumulatiefWerkelijkkg, col="Observed"))+
  geom_line(aes(x=Datum, y=CumulatiefWerkelijkkg, col="Observed", group=1))+
  geom_point(aes(x=Datum, y=CumulatiefPrognosekg, col="Old model"))+
  geom_line(aes(x=Datum, y=CumulatiefPrognosekg, col="Old model", group=1))+
  geom_bar(aes(x=Datum, y=CumulatiefPrognosekg-CumulatiefWerkelijkkg), fill="black", stat="identity", alpha=0.3)+
  geom_bar(aes(x=Datum, y=pred-CumulatiefWerkelijkkg), fill="purple", stat="identity", alpha=0.3)+
  geom_point(aes(x=Datum, y=pred, col="RandomForest"))+
  geom_line(aes(x=Datum, y=pred, col="RandomForest", group=1))+
  theme_bw()+
  theme(legend.position = "bottom")+
  labs(x="Date", 
       y="Cumulative production (kg)", 
       title="Tomato cumulative production - 2021 & 2022",
       subtitle="Observed vs Predicted", 
       col="Metric")

The graph shows that our model performs worse overall.

So, what about 2023?

df_test<-df_comp%>%
  filter(year=="2023")%>%
  select(-year)%>%
  cbind(., pred=predict(rf.mod, newdata = .))

df_test%>%
  ggplot(.)+
  geom_point(aes(x=Datum, y=CumulatiefWerkelijkkg, col="Observed"))+
  geom_line(aes(x=Datum, y=CumulatiefWerkelijkkg, col="Observed", group=1))+
  geom_point(aes(x=Datum, y=CumulatiefPrognosekg, col="Old model"))+
  geom_line(aes(x=Datum, y=CumulatiefPrognosekg, col="Old model", group=1))+
  geom_bar(aes(x=Datum, y=CumulatiefPrognosekg-CumulatiefWerkelijkkg), fill="black", stat="identity", alpha=0.3)+
  geom_bar(aes(x=Datum, y=pred-CumulatiefWerkelijkkg), fill="purple", stat="identity", alpha=0.3)+
  geom_point(aes(x=Datum, y=pred, col="RandomForest"))+
  geom_line(aes(x=Datum, y=pred, col="RandomForest", group=1))+
  theme_bw()+
  theme(legend.position = "bottom")+
  labs(x="Date", 
       y="Cumulative production (kg)", 
       title="Tomato cumulative production - 2023",
       subtitle="Observed vs Predicted 2023", 
       col="Metric")

Well, this is were it actually gets quite interesting, because it seems that for 2023 it is not even doing that bad. Yes, the predictions still have up and downs, and yes, overall it performs worse then the model of the farmer, but the model actually performs better in test than in train.

And if we calculate the RMSE, we can now see that our model is only twice as bad. The reason for this I am not too sure about and so I have material for a next conversation with the farmer.

sqrt(mean((df_test$CumulatiefWerkelijkkg - df_test$CumulatiefPrognosekg)^2))
[1] 78811.72
sqrt(mean((df_test$CumulatiefWerkelijkkg - df_test$pred)^2))
[1] 143826.7
sqrt(mean((df_test$CumulatiefWerkelijkkg - df_test$CumulatiefPrognosekg)^2)) / sqrt(mean((df_test$CumulatiefWerkelijkkg - df_test$pred)^2))
[1] 0.5479632

One last time, I can go and shoot with multiple other vanilla models.

control = trainControl(method = 'repeatedcv',
                       number = 10,
                       repeats = 3,
                       returnResamp = "all",
                       savePredictions = "all")
rf.mod <- train(x=trainX,
                y=trainY,
                method = "rf",
                trControl = control,
                metric='RMSE')
lm.mod <- train(x=trainX,
                y=trainY,
                method = "lm",
                trControl = control,
                metric='RMSE')
glmnet.mod <- train(x=trainX,
                    y=trainY,
                    method = "glmnet",
                    trControl = control,
                    metric='RMSE')
nn.mod <- train(x=trainX,
                y=trainY,
                method = "nnet",
                trControl = control,
                metric='RMSE')
results <- resamples(list(RF=rf.mod,
                          LM=lm.mod,
                          GLMNET=glmnet.mod, 
                          NN=nn.mod))
summary(results)
Call:
summary.resamples(object = results)

Models: RF, LM, GLMNET, NN 
Number of resamples: 30 

MAE 
            Min.  1st Qu.   Median     Mean  3rd Qu.      Max. NA's
RF     101463.08 206440.9 232107.5 260545.4 312073.9  527262.9    0
LM      95298.68 284426.9 362422.8 351370.0 431847.1  581031.1    0
GLMNET 161463.21 284543.0 329746.6 343262.7 412246.0  565464.6    0
NN     604648.00 709758.4 809272.4 819320.5 941546.4 1088515.3    0

RMSE 
           Min.  1st Qu.   Median      Mean   3rd Qu.      Max. NA's
RF     129667.7 244282.4 290079.9  318749.7  349480.5  742457.5    0
LM     118483.9 328530.3 447476.0  426938.5  491649.3  759670.0    0
GLMNET 194074.5 341022.9 402591.9  424214.2  475645.1  753914.0    0
NN     772813.0 904605.5 998230.1 1000706.7 1112814.9 1228418.5    0

Rsquared 
             Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
RF     0.03792723 0.6786296 0.8129782 0.7396046 0.9170701 0.9687293    0
LM     0.04958415 0.3426814 0.5540238 0.5574358 0.8409590 0.9901062    0
GLMNET 0.00820621 0.4171464 0.6291538 0.5834208 0.8141671 0.9696284    0
NN             NA        NA        NA       NaN        NA        NA   30

bwplot(results)

And now we see that multiple models have something to offer.

This is where I will stop because I will first need to collect additional information. Will keep you guys updated!