Predicting the Feed Consumption of a Swine Farm

Using Time-Series, Linear Mixed-Models and General Additive Mixed Models in R

Dr. Marc Jacobs
7 min readFeb 22, 2022

In this blog post I will show you how a particular way of predicting the feed curve of a swine farm using several different types of models. Of course, the road a successful model is not the technique itself, but rather the validity of the research question and its potential implication.

Lets walk through the example. Since it is commercial data, I cannot share the data with you. Then again, it is is always best for the learning exercise if you try to attach the codes and way of working to your own data. Any dataset may suffice, as long as it has repeated observations.

So, lets start by loading the libraries and the data.

As you can see, quite a nice dataset. It is not that clean, since it is commercial data. Clean data only exist in text-book examples.
Data on three farms, and the stables / pens within each farm.
DataExplorer::plot_missing(combined4)
df <- combined4[,colSums(is.na(combined4))<nrow(combined4)]
DataExplorer::plot_missing(df)
table(df$ble_id)
table(df$vbf_id)
table(df$vbf_id, df$lcn_id)
Quite some missing data.

Now, within each pen there are different feeding cycles. So, if we want to analyse the feed curve as function of day-in-curve I need to unlist the data further by adding another layer which I call newField. In the end, we have data on farm level, pen level, and newField (run) level.

The original data, the run identifier added, and the daily feed provision. The first two plots to the left show the cumulated feed provision. In this post I will show you how to model the data on the right to mimic the data on the left.

Lets proceed with the data exploration. Never think this part of the process is a waste of time! It has helped me out many many times, often after building my first set of models when helping me to understand why none of them made sense. You will often find yourself back and forth between plots, models, plots and models.

Feed curves plotted by farm and pen. Either cumulated or not, looking at components or not. There is no shortage of data to say the least.

My first instinct on modelling this type of data was to use time-series. Besides the nature of the data being indeed in a time-series, there are several ways of creating time-series ensembles that can pick of seasonal influences. So, that is what I tried to do first. Pick up any kind of temporal pattern.

Time-series models do not like (some actually throw a temper tantrum) when you have missing data. So, I used the Kalman method of the imputeTS package to fill in the blanks.
Plotting a tsibble.

Now, lets use that tsibble and let loose several time-series models, including an ensemble.

And then THIS is what happens. To be honest, I still have not figured out why it produces NULL models, even if I ask it to only run on feed curves that have at least 30m data points included. There should also be no missing data, but it just does not work. If you try to run forecasts on this list of models it will stop immediately since the first model on the list is actually not there. It is a list with holes.

In the end, after some tweaking, I gave up and resorted to another set of models — mixed models. What I like so much about mixed models is there ability to deal with missing data (under certain assumptions), and being able to model longitudinal data in a nested dataframe. Exactly what we have here.

ANOVA to find the best possible model.
Comparing five models. Model 4 has some serious issues despite having the lowest AIC. Just using the AIC to determine which model to use is not the best strategy as overfitting may easily take place.

Lets look at model 4 and model 1 more closely.

Model 4 is not too bad to be honest, but I still do not like the standard error of the coefficient estimates.
Negative predicted values of the cumulative feed provision is not what I want to look at. Random effects look okay.
Residuals — total, across feed run (newField) and pen. Residuals by pen shows some funny values. These value may easily disturb the model.
Calibration plot to the left, residuals by time, and QQ-plot by feed run (newField). The calibration plot is not that bad actually, but the residuals by time shows some weird values. As if the curve_day variable itself is not really correct.

The model coefficients in a nice table below.

I already showed you one calibration plot. Lets look at several more. To be honest, calibration plots are not that handy to use and compare models with, It would have been better to show density plots of the residuals across models. However, by the time I got here, I already made up my mind that Linear Mixed Models would not be the model-set to use.

Different models reveal different colors which reveals different levels of accuracy for the model fit. It does not matter anymore to be honest, since I wanted to shift to a different set of models by this time anyhow.

What I did above was add splines to linear mixed models — a procedure I have adopted numerous times across different species. Also the modelling of feed data is not new. But this time I wanted to approach a different way of modelling, using General Additive Models in a Mixed Model format. In addition, I wanted to split the data, using train and test, to get a better grip of the model.

Train and test data of selected pen’s. THIS is the level at which I will be modelling using GAMs in a Mixed Model structure.
The model output.
And the functions for each of the predictors included to model the cumulative feed curve.
Calibration plots. For some pens, the model performs much better then for other pens.
Residuals as density plots, and predictions per feed run (newField).
Calibration plot, overall, per feed run. To the right, the observed and predicted feed curve per feed run. The GAMs do a very nice job, but there is some intermittent demend / provision going on that the GAMs do not directly pick up. I need to keep a careful eye on that.

So, the above exercise was on the total cumulative feed provided. But, what if we estimate the feed provision by component? As you can see, all the models I create are run on the individual level and predictions are then aggregated to form a cumulative feed curve. That seems to do the trick. It is not perfect, but no model is.

Lets try out how things go for component 1.

Predictors and predictions. Not too bad.
Predictions on the individual and the cumulative level. Easy to see how the trained data is better then than the test data, but the test data is really not that bad. Especially if you will deploy this model and re-estimate predictions daily.

On to component 2.

A bit more tricky, but it does seem to do the trick. Empty oberserved values are true empty values of component 2. The model keeps predicting though because component 1 is a predictor of component 2.

Now, lets add component 3 to the mix, add everything together and see how the predictions do.

Component 1 and 2 together is really not that bad!
Not bad at all! Okay, it is not perfect, but we are able to do it!

So, this was a small example of how to use General Additive Mixed Models for estimating the feed curves of a swine farm. To be continues as this was just a very small beginning!

🔵 Become a Writer

--

--

Dr. Marc Jacobs

Scientist. Builder of models, and enthousiast of statistics, research, epidemiology, probability, and simulations for 10+ years.