Using Machine Learning to Predict the Weather in Basel — Pt. 2 Ordinary Least Squares With Smile

This is the second part of a series of articles that focuses on doing data analysis and machine learning on the JVM with Scala, using libraries such as Smile, Deeplearning4j and EvilPlot. The complete code for this blog post can be found on GitHub.

Remember that our goal, as stated in the first part of the series, is to predict the total amount of precipitation in millimeters for a given day, by using weather data from the days before. We looked into some baseline algorithms, and summarized our findings in the following table (RMSE stands for “Root Mean Squared Error”, and MAE for “Mean Absolute Error”):

Noting that the average daily precipitation is around 2.5mm, we concluded, that there is definitely room for improvement.

Let’s see if the regression algorithms offered by the Smile library can help us here. Smile describes itself as a fast and comprehensive machine learning system delivering state of art performance. It requires only the Java standard library, and might optionally use native BLAS/LAPACK implementations.

Before diving into details, let’s take a step back and look at what we are trying to do from a more abstract standpoint. Our training data consists of tuples of the form (x₁, y₁), (x₂, y₂), …, (xₙ, yₙ) (try installing the Symbola font if your browser fails to display some of the math symbols), where the individual xᵢ consists of various weather related metrics from the time before a certain day, and yᵢ represents the amount precipitation in millimeters on that certain day. The xᵢs, our features, are real valued vectors, and their dimension m, as well as their exact values, depend on the selected feature encoding. We write xᵢ = (xᵢ₁ xᵢ₂ … xᵢₘ)ᵗ (ᵗ stands for transpose) for the individual components of our feature vectors. Our objective is to learn a mapping x → y from feature vectors to target values, that generalizes well beyond our training data.

Ordinary linear least squares regression is one of the simplest algorithms that can do that for us. Given a row matrix A = (a₁ a₂ … aₘ) and a real number b we can define the function F(A, b): ℝᵐ → ℝ as

Ideally we would like to choose A and b so F(A, b)(xᵢ) ≈ yᵢ. Let’s express this more formally, and in a way the problem can be attacked mathematically. We define the loss for a single training example (xᵢ, yᵢ) by

Of course, we are not only interested in a single example, but mostly into the loss on our whole training set, which we define by

Now what ordinary least squares linear regression does, is finding us A* and β such that

The hope is that L(A*, β) would not only give us reasonable results on our training data, but also on other data the model was not trained on.

After this short theoretical excursion, let’s see how this can be accomplished in practice with Smile. Remember from the first part, that our training data has the type Seq[Seq[Map[String, Double]]. The Map[String, Double]represents a set of weather metrics from a single day, and the inner Seq stands for a time series, where the last element will be used to extract the value we want to predict. The inner Seq will therefore always have at least length 2. Looking at the relevant Smile Scaladocs or Javadocs we see that in order to train a regression model with Smile, we need to convert our data into an Array[Array[Double]]containing the training features, and a corresponding Array[Double]containing the target values. There is an infinite number of ways of doing this, but one of the most strait forward ones is probably this:

Using ExportDataUtilswe can implement the full transformation Seq[Seq[Map[String, Double]] => (Array[Array[Double]], Array[Double]) as outlined below:

Since all Smile regression algorithms implement smile.regression.Regression, we can easily implement our SingleValuePredictor from part one, for all Smile regression implementations at once:

An important detail can be noted here: The feature extraction algorithm Seq[Map[String, Double]] => Array[Double] is passed in explicitly, as constructor argument. This way the SingleValuePredictorTrainer(defined in part one) has the flexibility of using custom transformations (like dropping specific columns), because it can pass them on to the predictors it creates.

Here is a generic implementation of the former, that acts as an adapter for the algorithms implemented under smile.regression.

Using this template, a trainer that performs ordinary least squares regression using smile.regression.ols is trivially defined:

We have finally everything together to train an ordinary least squares model using our weather data. Employing the same code we already used to evaluate the trivial models in part one, here are the numbers we get together with what we already had the end of part one:

OLS-N stands for ordinary least squares using time series of length N. Now let’s think a bit about what these numbers actually mean:

  1. Compared to just predicting the mean amount of precipitation all the time, our root mean squared error “RMSE” went down from 4.8mm to approximately 4.4mm, which is an improvement of approximately 9%. The situation looks similar for the mean absolute error “MAE”.
  2. Using longer sequences doesn’t help.
  3. The training error and the test error are very similar. Thus, the source of our error is either because of high bias, or noise in the data (see bias-variance tradeoff for further information). Even if we had it, just throwing more training data at the algorithm would not help, since this would neither reduce the bias of the model, nor the noisiness of our data.

Let’s illustrate the last two conclusions by a few plots before we continue, to make sure that we got them right. In order to do so, we will take increasingly large parts of our training data, train an OLS model with it, and plot the training error and the test error. If we do so for OLS-1, we get the following picture (the full source that generated this plot can be found here):

We see that the training error keeps rising until a training size of about 3000. It then stabilizes at about 4.4mm. The test error conversely falls sharply until around a training set size of 1000 and then somewhat until a training set size of 3000, and stays around 4.6mm afterwards. The careful reader might point out, that in the table above, we recorded a test error of 4.4mm for OLS-1 and not 4.6 as in the plot. The reason for this discrepancy is that when generating the aforementioned table, we removed all values from August 2018 from training and test sets before doing our calculations, whereas in the plot above, these values have been included. This sensitivity to even minor changes in the training and test data we are working with tells us, that we should not interpret anything into small RMSE changes, as they might be the result of pure chance.

If we do the same plot with OLS-2(that is doing ordinary least squares regression using data from the last two days before), we get a somewhat similar picture, although the trend seems to be less clear and somewhat noisy:

Also note that the test error in this plot starts with a training set size of 500, whereas in the plot for OLS-1 we started with 250. This is because the test error for a training set of 250 examples for OLS-2 is almost 41mms. If we had included this point in the plot, we would have seen nothing else.

Last but not least, here is what we get if we do the same plot for OLS-3:

For small training set sizes, the test error is so large, that we didn’t include it in the plot: Over 1e10 for 250 training examples, and still over 22 for 1000 training examples. This is overfitting in its purest form. What’s also quite unusual here, is that the training error is considerably larger than the test error. This might be a statistical fluke though. By changing the random seed that determines how we split our data into training and test sets, we get different plots. One time with seed 23:

Here is the same plot, but calculated with the seed 0xCAFEBABE:

Although the details differ somewhat, what is common to all these plots is, that there are no considerable improvements related to the test error, as soon as we have about 3000 training examples.

Summarizing we can say that using ordinary least squares regression didn’t gain us much compared to simply predicting the mean each time. This is not due to a lack of training data, but either because OLSmodels are too biased, or because the data is in fact mostly noise.

To find out which of these alternatives apply, it would be nice to have a model where we can reduce the bias at will. Decision tree based models allow us to do exactly that. We are going to look into them in depth in the next part of this series.