Using Machine Learning to Predict the Weather in Basel — Pt. 1 Data & Baselines

This is the first part of a series of articles that focuses on doing data analysis and machine learning on the JVM with Scala, using libraries such as Smile, Deeplearning4j and EvilPlot. The complete code for this blog post can be found on GitHub.

As someone who loves being outdoors the weather plays a significant role in my life. Since I also really like experimenting with data and machine learning, I wondered, to what extent techniques from machine learning could be applied to do weather forecasting. Of course, any serious attempt, and I’m sure that such attempts are made at the very time I’m writing this article, would require access to large amounts of weather data, as well as considerable computing resources. This is not the point of this blog post series however. I wanted to know how well I could do with my rather modest resources using data that is freely available.

Since I’m living in Vienna, the most natural thing for me to do would be trying to do forecasts for my hometown. However, then I found very detailed historical weather data from meteoblue for Basel, that is freely available, and decided that this would serve my purpose just fine.

For starters I decided to download a CSV that contains a single row for each day, going back till January 1985. Data with hourly resolution is also available, but I wanted to keep it simple. There are 45 columns in total, including

  • year, month and day
  • the temperature (min, mean, max)
  • the total daily precipitation and snowfall
  • the pressure (min, mean, max)
  • the humidity (min, mean, max)
  • wind direction and speeds (min, mean, max) at different heights
  • high, medium and low cloud covers (min, mean, max)
  • sunshine duration

I parsed this file into a Seq[Map[String, Double]], where each Map[String, Double]contains the data for a single day, organized by their column names.

To get a feeling for the data, lets use EvilPlot to explore it a bit. At first lets plot the min, mean and max temperature for each day in July 2018:

Doing this with EvilPlot is really a rewarding experience, and is accomplished by the following piece of code:

Note that since I parsed the CSV into a Seq[Map[String, Double]], expressions like r(Column)just select the respective column in row r. Here is another plot, that shows minutes of sunshine vs. millimeters of rain for July 2018:

This plot has been generated by the following snippet:

Its exactly the total precipitation daily, represented by the blue bars in the plot above, that we want to predict, using data from the days before.

Before employing sophisticated models, lets look at the performance of some very simple baseline algorithms. If we can’t beat those by a significant margin, our data is either mostly noise, or our models are just not worth it. Consider reading How (not) to use Machine Learning for time series forecasting if you are interested in some background.

Our models will implement the following interface:

One of the simplest models one can imagine, is the so called persistence model. It will just predict that today will be the same as yesterday, in code

Another, very simple algorithm, is to just predict a constant value all the time:

Setting this constant to the mean of the observed target values will minimize the training error if we use root mean squared error as our measure of performance. To obtain such a predictor from training data, we will implement the following trait, that will also be used by more interesting models later:

An implementation that simply returns a constant value predictor for the mean of the training data could look like this:

Although it makes only limited sense at this stage, it will help when making comparisons later, so before evaluating these models, we divide our data into training and test sets, using an 80:20 split:

Using this split, we get the following results with our trivial algorithms defined above (RMSE stands for root mean squared error and MAE stands for mean absolute error):

Note that the RMSEs of these models differ, while the MAEs are almost identical. This indicates that the persistence model makes less errors, but when it does, these errors are more severe. In any case, these numbers are quite high, if we keep in mind, that the mean total daily precipitation in Basel is around 2.5mm and the median is at 0.1mm. In the next post of this series, we will see if we can do better using using some of the regression algorithms from Smile.