Using Machine Learning to Predict the Weather in Basel — Pt. 3 Trees with Smile

This is the third part of a series of articles that focuses on doing data analysis and machine learning on the JVM with Scala, using libraries such as Smile, Deeplearning4j and EvilPlot. The complete code for this blog post can be found on GitHub.

In the last part of this series we looked at ordinary least squares regression with Smile, and found that this gives us only slightly better results than always predicting the mean precipitation of our training data. In this part we are going to look into doing regression with decision trees. If you wonder how they work, I invite you to take a look at Dr. Saed Sayads data science page where you find a nicely illustrated explanation.

Decision trees have the following advantages, that I want to highlight here:

  1. The algorithm that builds them is easy to understand.
  2. They are quite flexible and have low biases if the number of leaves is high.
  3. After being constructed, they can be examined and interpreted.

However they have one major disadvantage: They can have a very high variance, especially when the number of leaves is high. Due to their nature, they also result in non smooth approximations.

Regression trees can be trained conveniently in Smile using the smile.regression.cart function. As with other smile algorithms, the training data is expected to have the type Array[Array[Double]], and corresponding target values are passed in as an Array[Double]. Last but not least, we have to specify the maximum number of leaf nodes. More nodes lead to more flexibility of the model and thus a lower training error. On the other hand, if our trees are too large, they might easily overfit our training data, and perform poorly on new data. To find the best balance between these two aspects, we could simply try different numbers of nodes, and compare the test errors we get. After doing so however, our test set would no longer be a test set, in the sense that it would give us a reasonable estimate of the true error of our model. Thus what is commonly done is one of two things:

While setting aside a validation set is definitely the method of choice for medium and large amounts of data, cross validation and bootstrapping can give more accurate results at the expense of additional computing power when dealing with rather small amounts of labeled data.

In k-fold cross validation, we first divide our training set into k folds, and then we subsequently use one of these folds to evaluate, and the remaining folds to train our model. After repeating this procedure k times for the different folds, we take the average of the metric we are interested in. This is best explained by a diagram:

Image from Karl Rosaens Blog http://karlrosaen.com/ml/learning-log/2016-06-20/

The more folds we use, the better the error estimates we get. Of course we have to pay for this with CPU cycles. In the most extreme case, called leave one out cross validation, we are using as much folds as training examples we have. In practice between 5 and 25 folds seem to be the most common. To get more stable estimates, one can also run cross validation multiple times, using different random seeds, and then average the results. Here is a simple function, that uses Smiles build in cross validation support, to calculate the RMSE for a regression tree with the given number of maximal leaf nodes:

To select the maximum number of leaf nodes, I ran 25-fold cross validation 100 times, averaging the results, for 2 to 50 maximum leaves, and plotted the cross validation error (red) together with the training error (blue) using EvilPlot. This is the result for time series of length one (the full code that generated this plot can be found here):

Cross validation RMSE vs. train RMSE for different numbers of leaf nodes when looking at time series of length 1

The trends we see are as expected: At first, as we increase the number of leaves, the cross validation as well as the training error go down. After a certain point however, that happens to be 11 in our plot, the cross validation error, which is an approximation of the true error of our model, starts to increase again, although the training error keeps falling. Enter overtraining!

Unfortunately though, even with the optimal number of maximum leaf nodes, the cross validation error is still about 4.6mm. Repeating the same process for time series of length 2 (that is using features including metrics from one and two days before the day where we want to make our prediction), we find that the optimal number of maximum nodes in this case is only 4. The same is in fact true for time series of the length 3 and 9.

Cross validation RMSE vs. train RMSE for different numbers of leaf nodes when looking at time series of length 2

The following table summarizes our results and adds them to what we already had at the end of the last part of this series:

Note the two additional columns RMSE Cvand MAE Cv containing cross validation metrics. They have been calculated by running 25-fold cross validation 50 times and averaging the results. If asked to choose the best model, it would be based on these new columns, rather than the columns with the test error. Thus, at this point, we would pick OLS-2, since it has both the lowest cross validation RMSE and MAE.

Apart from that, as you can see, these numbers are quite similar to what we already had with ordinary least squares. However, as the plot with cross validation and training error above shows, trees were indeed able to adapt better to our training set than OLS, but they failed to generalize.

Before we discuss this in more detail, let’s take a closer look at the generated trees. As mentioned at the beginning of our discussion, one of the advantages of regression trees is their interpretability. Indeed, Smile RegressionTrees can be converted to DOT for inspection by simply invoking the dot method. Here for instance is the graph for the regression tree we get for time series of length 1, when the maximum number of leaf nodes is set to 4 (the graph was rendered using GraphvizOnline):

V2, V9, V26 stands for the feature with index 2, 9 and 26 respectively. To make sense of this, you can consult the following table:

The score that is given in the various nodes is the reduction in mean squared error we get by the associated split. Using the table above, we see that at first, it is checked whether the minimal sea level pressure is below or equal to 1013.65 hPa. If this is false, the model predicts a precipitation of 1.2716mm. Otherwise, the model inspects the high cloud cover. If the daily mean is below or equal to 48.27% percent, it predicts 3.247mm, if not the total precipitation is examined. If we have less than or equal to 1.45mm, the prediction is 4.1571mm, otherwise it is 6.5427mm.

Note that the aforementioned score can be used to get a sense of the importance of individual features. One can therefore use regression trees for feature selection, and actually run a different algorithm on the selected features later.

Now let’s come back to the actual performance of our tree models. As we have seen, regression trees were able to adapt much better to our training data than ordinary least squares, but in a non generalizable manner. Their cross validation and test metrics are in fact consistently worse than what we got from ordinary least squares. This brings us to one of the major disadvantages of regression trees, that was already mentioned above: Their tendency to overfit.

One variation of the regression tree algorithm that attempts to deal with this problem is called Random Forest. Instead of training just one tree, this algorithm trains many trees, on different samples of the training data, using randomly chosen subsets of the feature columns. To make predictions, the results of the individual trees are averaged. This turns out to indeed give as slightly better results than plain trees, but I wasn’t able to outperform ordinary least squares, even after extensive hyperparameter tuning.

Another variation of the regression tree algorithm is called Gradient Boosted Trees. Explaining this algorithm in detail is outside the scope of this article, but the basic idea is to train a sequence of trees, where each individual tree is trained to correct the errors of its predecessors. Unfortunately, as for the performance of this algorithm on our data, the situation is quite similar to what was already said about random forests.

Here is the updated table with evaluation metrics from random forests ( Forest-1) and gradient boosted trees ( GBM-1, GBM-2):

Random forests where trained with default parameters, and gradient boosted trees too, except that I used GradientTreeBoost.Loss.LeastSquares instead of LeastAbsoluteDeviation. The careful reader might recognize that random forests, as well as gradient boosted trees suffer from overtraining, since in both cases the cross validation and test errors are significantly larger than the training errors. Also note the absence of Forest-3 , Forest-9, GBM-3 and GBM-9. This is not only in order to keep the table more concise, but also because obtaining these numbers takes a considerable amount of computing resources (remember that we have to run 25-fold cross validation 50 times for each row). Here is a table with the training times for the different algorithms on my workstation (results where obtained using ScalaMeter):

As you can see, as long as we use ordinary least squares or simple trees, training times are well below 0.1s. The ensemble algorithms however might take almost a minute to train. Roughly, we can say that a random forest with the default configuration, takes about 100 times longer to train than a single tree with 11 leaf nodes (this number would be higher if we looked at CPU times since the trees in the forest are trained in parallel).

When looking at prediction times, we get the following timings:

As before, simple trees and ordinary least squares are blazingly fast. Performing predictions using a forest with 500 trees is about 5000 times slower than doing predictions with a single, simple tree. With gradient boosted trees (individual trees have 6 leaf nodes only), the slowdown factor is only (!) around 200.

Also note, that while serializing OLS-1 to disk takes about 80KB, the uncompressed Forest-1 model has 69MB, when using standard Java serialization.

Given their immense hunger for computing resources, and the fact that they did not significantly outperform the other algorithms we looked at so far, clearly disqualify tree based ensemble models for our particular problem. And while single trees are CPU efficient and nicely interpretable, they are still slightly worse than ordinary least squares.

We should be careful not to shoot the messenger though: All the metrics we have assembled so far clearly indicate that our data is mostly noise, and that predicting the weather just based on a few readings from the days before does not allow for more accurate predictions. Even the best models cannot establish relations that are not there. In the next part of this series I will give the problem a last shot applying neural networks. I will use this opportunity to introduce DL4J and compare it with Smile.