Developing a Weather Model with Machine Learning in Python

8 min readOct 25, 2022

I decided to see if it is possible to create a weather model to predict future temperatures using past weather data and a machine learning algorithm. It will be a challenge, but I’m confident that I will be able to at least create something that can predict pretty accurate temperatures for the next couple of hours.

Snowy winter day in the High Fens (Belgium)

Some Constraints

Of course, I won’t be developing a full-blown weather model. Real weather models need supercomputers to run. Furthermore, real weather models, like GFS and ECMWF, use gigantic amounts of data and complex equations to predict a wide variety of parameters. My little weather model will have some constraints:

It will only be able to predict the temperature and humidity
It will only make a prediction for a single location
It will only forecast for up to 24h ahead
The model will require that the cloud cover and wind speed remain constant over the course of 24h

Finding Data

I live in Central Belgium, so preferably I would want data for my location. The data from personal weather stations are quite limited. The biggest drawback is that they don’t measure cloud cover, which is one of the most important parameters influencing temperature.

The national weather service of Belgium (KMI) doesn’t provide open data, so I will have to find data elsewhere. The Dutch national weather service (KNMI) does offer open data, free of charge. It’s not ideal, but since the southeastern part of the Netherlands has a very similar climate compared to Central Belgium, this data will probably work.

Downloading the Data

Using the KNMI website, I downloaded data for the most southeastern weather station in The Netherlands: Maastricht. I started off by downloading hourly data for 10 years: between 1 October 2012 and 1 October 2022. I might download more data if needed, but 87,000 lines of data will be sufficient for now.

I selected a bunch of parameters I thought could be useful to have for predicting temperature: temperature itself, wind speed, wind direction, cloud cover, humidity and air pressure.

How to Predict the Temperature?

My idea for the model is that the user gives in the current weather conditions (temperature, humidity, wind speed, etc). The model will then calculate the change in temperature based on those inputs. Finally, the program will add the temperature change to the current temperature.

So I will need to create a new column with the change in temperature over the next hour.

Adding Temperature Change

The algorithm for calculating the change in temperature is simple. We’ll just subtract the current temperature from the temperature in one hour.

ΔT = T(H+1) — T(H)

A simplified example of our desired end result

Python code to add the delta T column (T: temperature)

Analyzing the Data

With the ΔT column added, let's take a look at how the data itself, and how some parameters might be correlated with the change in temperature.

Correlation heatmap for the parameters delta temperature (DT), humidity (U) and cloud cover (N)

The correlation plot shows only a weak correlation between the variables. The upper row is the most important to our model. The correlation between the change in temperature (DT) and other variables is weak.

The reason for this is that the date and time aren’t accounted for yet.

Date and Time

The most significant aspect of the temperature change over the next hour aren’t the parameters but the time of day. In the evening temperatures will fall most of the time, regardless of the parameters. The other input parameters will help to predict the extend of the temperature drop.

Temperature change over the next hour for each hour of the day (UTC)

It’s thus very important to take time into account. Besides the time of day, the date will also be of great importance. In summer, the temperatures start to rise at 6:00, while in winter it’s much later.

To account for the time of day and the time of year, we will need to split up the dataset.

Splitting up the Dataset

To take date and time into account, we will split up the dataset into datasets only consisting of the data of a particular month and time combination. An example is October-20h, which consists of all data measured in the month of October at 20h.

Since our dataset consists of 10 years of data, this split-up dataset will count as 10*30 (#years*#days in a month) = 300 lines of data. I figured this is enough data for making a decent weather forecast, but ideally, you would want more data to predict temperature more accurately.

The code for splitting the dataset. First, the data is split into 24 files by hour. In the second loop, it’s further split into hour-month combinations, which resulted in 288 files.

If we take a look at a correlation heatmap for October-9h, for example, it’s clear that the correlation between delta temperature (DT) and cloud cover (N) and humidity (U) is now more notable. Both have a negative correlation, which means that at 9h the temperature rises quicker if there are fewer clouds and if there is lower humidity.

Another notable correlation is the correlation between humidity (U)and cloud cover (N). The positive correlation between those parameters means that when there are more clouds present, the humidity levels are higher.

Building the Machine Learning Model

I will be using the SciKit Learn package for Python for training the model. For this application, I will be using the Linear Regression model. This type of machine learning model tries to find the most optimal regression line between multiple variables.

The code for training the linear regression model

Coding a linear regression model is quite easy. We start off by opening the file with the corresponding hour and month. The x variable consists of all independent variables (variables that will be used to make up a linear regression). I tested around with some different combinations, the results were best when using humidity (U), cloud cover (N), wind speed (FF) and temperature (T) as independent variables.

Note that I will be using all available data for training the model, if you want to calculate accuracy, the data should be split up into train and test parts.

The y variable is the parameter that is going to be predicted (dependent variable). We fit the linear regression model with the data on line 28. Afterwards, the delta_temperature (DT) is predicted and added to the input temperature, to get the exact temperature in one hour.

Predicting Temperature for Multiple Hours

Predicting the temperature for more than one hour requires a small addition to the code. I created a simple loop which calls the function to calculate the temperature. That function returns the temperature prediction, which is subsequentially stored in a list. In every iteration of the loop, the hour variable is incremented by one hour. Finally, there are two lines of code to format and add the hour to the list of “hours”.

During every iteration, the newly calculated temperature is fed into the “calculate_temperature” function.

Lastly, the matplotlib library is used to plot the “temperatures” and “hours” lists.

The Results

The weather model only works when the cloud cover and wind speed remain relatively stable over the course of the prediction period. On 22.10.2022, the real weather models predicted a consistent cloud cover and wind speed between 7h and 19h (UTC+2). So I went ahead and gave in the inputs.

Month: 10, Hour: 7
Temperature (°C): 14.3, Dew Point Temperature (°C): 12.9
Cloud Cover (0-8): 6, Wind Speed (km/h): 10

The temperature prediction for the given input

The national weather service predicted 18°C as the maximum temperature that day, so I was a bit worried that the temperature prediction would be too high. Nevertheless, the output looked quite realistic.

In the end, my model (blue line) performed quite well, it tracked the actual temperature measured by my weather station (yellow line) quite closely until about 16h. The Arome weather model (red line), considered one of the best local models in Western Europe, performed much worse.

From 16h onwards, the model didn’t do too well. One reason for this could be that most clouds had moved away. Nevertheless, I’m happy with the results.

Over the following days, I tested the model extensively. It performed quite well most of the time, as long as the cloud cover and wind speed didn’t change. During the night the predicted temperatures were often higher than the actual temperatures measured. During the day the results were better.

What about Humidity?

At the start of this post, I mentioned that the model would be calculating humidity levels too, besides temperature. To implement this, there is a column called “delta humidity (DU)” created in the same way as the DT column was added. The code for calculating humidity is almost identical to that for calculating temperature. The independent variables are the same, the independent variable is DU this time, instead of DT.

In the loop, the newly built “calculate_humidity” function is called after the “calculate_temperature” function. The model is now capable of using the predicted humidity values to calculate temperature and humidity for the following hours.

With the same inputs as described in “the results”, “model T” (temperature model) generally performed better than “model H” (temperature-humidity model). Model H was more accurate from 16h onwards though.

Over multiple tests, I’ve concluded that the performance of both models is comparable. The big advantage of the humidity model is that it also calculates dew point temperatures. The calculations for the dew point temperatures were surprisingly accurate.

Drawbacks and Possible Improvements

The models are far from perfect, but keeping in mind that models running on supercomputers, can only make decent forecasts for a maximum of 7 days ahead, I think it’s quite a big accomplishment.

The major drawback of this model is that it can’t handle changes in cloud cover and wind speed. Those parameters are difficult to predict based on the others. An option to input predicted changes in these parameters could be useful.

Other improvements include adding more data and making an ensemble, which just means running the model X times with slightly different inputs every time.

Thanks for reading!